2014 May 27: ASR meeting (SJR, JD, FRM, MS, PS, JK, PJB)
Systems we'll be working on over summer (coordinate them):
- BBC ( FRM to finish preprocessing LM data and classify by genre) - train on one week data + English Euronews? [PJB] ... PS's CNN training should be easy enough to apply
- Sky News - treat as a facet of BBC task; aligned data now 38.5 hours out of 280, hard because of non-speech sound and mismatch between speech and subtitles (to be improved by getting raw subtitles and doing our own filtering of them) [JD]; JK getting some of data transcribed manually; JD currently using either CED-filtered IWSLT LM data or Euronews, but can add BBC subtitles when FRM finished processing, and BBC and Sky website data [JD];
- Chris Baume's Vamp plugin for speech/music discrimination may be worth trying, and we can talk with MF about his acoustic event detection
- IWSLT - English, German and Italian (Italian using Euronews?)
- inEvent - TED+AMI+ICSI training for wide coverage (try this also for IWSLT)
- Advert detection might be an interesting task on Sky data; also repeated material detection.
- SJR expects all our systems to be Kaldi-based soon. We should have a library of models trained on appropriate data sets.
Who's working on what?
- IWSLT: JD (German and Italian - try to get better set of training data than for last year's German [SJR to ask IWSLT organisers what's available]), FRM and PJB (English); need to register, and talk with AB this week about it
System releases:
- NST: BBC project Comma: you can upload a service as a virtual machine in Amazon Cloud
- Completely open source system desirable, using AMI data in a Kaldi recipe and avoiding LDC dependency - PS already working on such a system, using CMUDICT, with LM using Fisher data (need to check permissions)
- what LM data are freely available? - TED (FRM's scripts to be released), AMI, Google 10^9 words; ask ICSI about their meeting transcripts, and talk with Tony Robinson about Google data [SJR]
- licences: Apache for Google data, Creative Commons for AMI (Share Alike, upload any changes you make)
- how many downloads of AMI data will there be? - SJR guesses 100 in first month and 1000 in first year; JK to check that our systems can cope and fix any remaining filename problems and extract text from files
- for TED we can start with a manual segmentation of acoustic data in decoding (but automatic alignment in training: manual segmentation available only at 1s resolution);
- using long segments will lead to search errors in Kaldi as in HTK, but MS has a possible hybrid segmentation solution (first pass with his segmenter, refined by decoder)
Need to combine above notes with the below...
ACTION / Kaldi AMI recipe
- JK fix AMI corpus filenames
- JK extract transcripts for LM purposes
- checking the ICSI transcripts for open LM use
- ask TR re background LM from google billion words
IWSLT/TED
- JD - IT and DE models
- PB - NICT comparable acoustic models
- Inevent compatible acoustic models
- PB/FM - Kaldi TED setup
- PB build ICSI+AMI+TED AM
- MS DNN-based acoustic model for segmenter
- FM etc contact Lexi to coordinate
- mail IWSLT organisers about standard acoustic model training data
BROADCAST/SKY+BBC
- bin subtitle files
- seg adverts
- EN+BBC1wk+Weather am
- get Sky+BBC news web data?
- FM working on BBC LM data - classify by genre
- CNN training for BBC
- BBC plugins for Audacity
- check Marc Ferras acoustic event detection
Next meeting: Jun 5, 16:30
--
SteveRenals - 06 Jun 2014 -->
Topic revision: r1 - 06 Jun 2014 - 17:40:31 -
SteveRenals