2014 May 27: ASR meeting (SJR, JD, FRM, MS, PS, JK, PJB)

Systems we'll be working on over summer (coordinate them):

  • BBC ( FRM to finish preprocessing LM data and classify by genre) - train on one week data + English Euronews? [PJB] ... PS's CNN training should be easy enough to apply
  • Sky News - treat as a facet of BBC task; aligned data now 38.5 hours out of 280, hard because of non-speech sound and mismatch between speech and subtitles (to be improved by getting raw subtitles and doing our own filtering of them) [JD]; JK getting some of data transcribed manually; JD currently using either CED-filtered IWSLT LM data or Euronews, but can add BBC subtitles when FRM finished processing, and BBC and Sky website data [JD];
  • Chris Baume's Vamp plugin for speech/music discrimination may be worth trying, and we can talk with MF about his acoustic event detection
  • IWSLT - English, German and Italian (Italian using Euronews?)
  • inEvent - TED+AMI+ICSI training for wide coverage (try this also for IWSLT)
  • Advert detection might be an interesting task on Sky data; also repeated material detection.
  • SJR expects all our systems to be Kaldi-based soon. We should have a library of models trained on appropriate data sets.

Who's working on what?

  • IWSLT: JD (German and Italian - try to get better set of training data than for last year's German [SJR to ask IWSLT organisers what's available]), FRM and PJB (English); need to register, and talk with AB this week about it

System releases:

  • NST: BBC project Comma: you can upload a service as a virtual machine in Amazon Cloud
  • Completely open source system desirable, using AMI data in a Kaldi recipe and avoiding LDC dependency - PS already working on such a system, using CMUDICT, with LM using Fisher data (need to check permissions)
  • what LM data are freely available? - TED (FRM's scripts to be released), AMI, Google 10^9 words; ask ICSI about their meeting transcripts, and talk with Tony Robinson about Google data [SJR]
  • licences: Apache for Google data, Creative Commons for AMI (Share Alike, upload any changes you make)
  • how many downloads of AMI data will there be? - SJR guesses 100 in first month and 1000 in first year; JK to check that our systems can cope and fix any remaining filename problems and extract text from files
  • for TED we can start with a manual segmentation of acoustic data in decoding (but automatic alignment in training: manual segmentation available only at 1s resolution);
  • using long segments will lead to search errors in Kaldi as in HTK, but MS has a possible hybrid segmentation solution (first pass with his segmenter, refined by decoder)

Need to combine above notes with the below...

ACTION / Kaldi AMI recipe

  • JK fix AMI corpus filenames
  • JK extract transcripts for LM purposes
  • checking the ICSI transcripts for open LM use
  • ask TR re background LM from google billion words


  • JD - IT and DE models
  • PB - NICT comparable acoustic models
  • Inevent compatible acoustic models
  • PB/FM - Kaldi TED setup
  • PB build ICSI+AMI+TED AM
  • MS DNN-based acoustic model for segmenter
  • FM etc contact Lexi to coordinate
  • mail IWSLT organisers about standard acoustic model training data


  • bin subtitle files
  • seg adverts
  • EN+BBC1wk+Weather am
  • get Sky+BBC news web data?
  • FM working on BBC LM data - classify by genre
  • CNN training for BBC
  • BBC plugins for Audacity
  • check Marc Ferras acoustic event detection

Next meeting: Jun 5, 16:30

-- SteveRenals - 06 Jun 2014 -->

Topic revision: r1 - 06 Jun 2014 - 17:40:31 - SteveRenals
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies