-- Main.matthewa - 05 Jul 2006 Project Home

Weekending 7th July 2006

* Implemented phone/letter alignment algorith (as specified in Damper et al Aligning Letters and Phonemes for Speech Synthesis (Pittsburgh 2004)

* Set up ephones CVS repositry in local festival CVS

* Attended Speak meeting

Weekending 14th July 2006

* tested alignment tool with CMU

* Generated PBA arc output

* Generated unit selection style output

* Began implementation of unit selection viterbi engine to search both. Currently under test.

Weekending 21st July 2006

* Completed code unit selection engine to do viterbi search through PBA arcs and LTS units

* Created test harness for code in Pythion and a standalone version for valgrind

* System under test

Weekending 28th July 2006

  • unit selection LTS works (although requires tuning)

  • Examined Fiona Kenny's scripts and looked at setting up an Ergodic HMM to initially cluster speech.
    • How about just starting with her clusterings and the corresponding words, and doing LTS learning on those two sequences? This will be a quicker way to start than learning how to do ergodic HMM learning with BIC for setting number of states -- Main.simonk - 31 Jul 2006

  • Had a look at possible improvements to the alignment process. Checked out some machine translation work, and some DNA segmentation work.

Weekending 4th August 2006

  • Implemented some scripts to generate n-grams from the lexicons with a view of making 'phrase' tables.

  • Read a bit more MT litrature on the subject

  • setup HTK to use Fiona kenny's initial ergodic 100 state models. Tried training and wrote scripts to extract state alignment.

Weekending 11th August 2006

  • Implemented a baseline kmeans clustering to compare with the ergodic hmm

  • Tried using output from kmeans to set means and variances of models

  • Explored the use of PyMPI to speed up model learning using the beowulf cluster

  • Looked at 13dim output with 45 states

  • Read up a bit on BIC

Weekending 18th August 2006

  • Got SLPA tools running to find silence.

  • Tries using a silence model during alignment trained using SLPA. The idea was to have more of the ergodic states cluster actual speech. This failed quite impressively with the silent model centre state labelling all speech.

  • returned to 13 dim ergodic model which looks 'okay'. Carried out an evaluation of states against slpa and htk phone alignments.

Weekending 25th August 2006

  • Back onto reading up MT work on alignment. Perhaps Marcu/Wong approach would suit

  • Implemented baseline discrete HMM LTS system using HTK.

Weekending 1st September

  • Set of initialisation on for the Discrete Model

  • Meeting with Simon King to determine best strategy

  • Met with Phil Koehn to talk about using MT software

  • Reviewed Pharoah Documentation

Weekending 8th September 2006

  • Got Pharoah running

  • Did evaluation

  • Started to sort out Festival LTS

Weekending 15th September 2006

  • Preparation for ICSLP

Weekending 22nd September 2006

  • ICSLP Pittsburgh

Weekending 29th September 2006

  • Built Festival LTS cart system
  • Project Meeting
  • Looked at Multigram work by Deligne and Bimbot

Weekending 6th October 2006

  • Continued on Joint Multigram approach
  • Meeting with Fiona Kenny to discuss project
  • began implementation of joint multigrams

Weekending 13th October 2006

  • completed code for generating multiple segmentations. Problem with word length having an exponential effect on the number of segmentations and their size

  • Briefly considered using syllabification to constrin segmentatiojn
  • Switched to usiong sqlite with python to allow any size of database
  • Started code on generating lattices from arcs

Weekending 20th October 2006

  • make lattice code complete
  • Forward/backwards implemented

Weekending 27th October 2006

  • Meeting with Simon on topology. Decided we did not need to normalise arcs in nets unrolled in time.
  • embedded training complete
  • training code complete
  • testing show net is unstable

Weekending 3rd November 2006

  • Fixed instability by using word probs in forward backwards
  • Tested system
  • looked at constraints for joint arc length differences

Weekending 10th November 2006

  • Completed full pass on 20k lexicon with TEST/TRAIn with 3-2, 2-3, 2-2, 1-2, 2-1, 1-1 arcs
  • Decided to ignore silence problem for now and use standard segmentation for silence detection
  • Wrote scripts to extract ergodic model sequences v cannonical lexicon
  • Decided to implement rolling window to deal with long sequences.

Weekending 17th November 2006

  • Implemented rolling window
  • Debugged running joint multigrams for canonical phone/ state inout
  • Wrote scripts to merge segmentation with timings to examine segmentation
  • Set database to run in memeory for speed over test set
  • Expect to have segmentations of 100 utterances by Saturday.

From 17th November to 7th March

  • Multigram segmenatation on ergodic states was unsuccessful.
  • Tried DTW approach to segmentation. This looks a lot more promising. Basically involves cpomparing every uttereance with every other uttereance and finding regions which are similar across uttereances. The transitions into and out of these regions are then used to pick segmentations by peak picking the frames with the most transitions in them. In addition I normalised the mfcc before dtw and upped the amplitude componenet x10.
  • Presented prototype of this against ergodic segmentation to one dat workshop in Birmingham.
  • Implemented clustering code for categorising the segments found with DTW. Problem with 'stray' clusters with only a very few number of members.
  • back to joint-multigram alignment with DTW segments. Doesn't look too good. Problems with sparsity.
  • Tried to use duration information from clustering as a prior on the joint multigram alignment but found the system didn't converge. Basically used duration distribution of actual clusters to assess distributions of letter class and then P(ltrs|units) based on these distributions. This was then used to modify the model and the idea would be that it would then reinform the duration distributions. Need to discuss this wioth someone who has tried this sort of thing before.
  • looked at re-clustering to to broad units to initially inform segmentation. Looked at clustering letters based on letter context in order to equally produce broad letter classes.

Change of format for Progress reports

  • Instead of an end of week report I will add a Wednesday morning entry for my planned activities over the week. I will then add an end of week entry to report on progress against the plan.

Plan 07/03/07

  • Get DTWseg working on PyMPI on beowulf cluster

Result 09/03/07

  • Managed to crash Townhill. changed to using qsub. job appears pending.
  • running again on gan with smaller length threshold.
  • Estimated time of segmenting 2000 files is about a year an a half.
  • Will reestimate using optimised version of dtwseg on both gan and townhill.
  • can reduce comparison set to make it linear time if necessary.

Plan 09/04/07

  • Modify clustering to control max and min size of clusters and to use CLARA approach
  • generate full nin segmentation
  • play with cluster size re segmentaton and evaluate against ground truth classic segmentation (subset)

Plan 23/05/07

  • redo joint multigram with frequency only and straight EM
  • experiment with getting proportions of multigrams types predictable through a threshold
  • analyse juintmultigram frequencies over co-segmented data

  • Did the above results don't look good
  • wrote code to generate joint mgrams from phone/ephone files

  • NEXT WEEK: use this to set frequency and generate answers. (Analyse extent multigram types effect data). i.e. If we learn the right parameters how well should it do.

Result 31/10/07

  • Multigram problem solved by using orthographic segmentation to find silence and a prior for word boundaries.
  • Evaluation of units based on lexical versimilitude implemented. Results poor for dtw segmentation. For Junichi's ergodic segmentation also poor. However we realised that he had used the wrong audio for this data so it is being done again.
  • Lexical v unit alignment using multigrams appears generally robust. Input from this can be used in a unit selection LTS fragment system (or CART or PBA).
  • Attempt to use orthographic segmentation to condition dtw unit clustering. Results were not encouraging and highlighted the rather arbitrary initial parameters used in the dtw. i.e. normalisation, acoustic features, thresholds etc.
  • Presentation at conference.
  • In general the success of the orthographic segmentation of the data has highlighted floors in the the selection of the emerrrged usnits.

Plan 1/11/07

  • Junichi is redoing the ergodic segmentation with the correct acoustics. As sson as this is done I will re-evaluate. If the results look at all useful we can use the lexicon to retrain and I will have data to work on the LTS system. Expected to be ready next week.
  • Junichi could use orthographic segmentation to condition ergodic model (certainly by removing silence and possibly by retraining on word segments.
  • I'm currently rewriting the dtw segmentation code so that it is more robust, better documented and a lot faster. This is in order to look at the selection of weights and acoustic parameters which are best for this approach. I intend to try a gradient ascent weight tuning on acoustic parameters based on the orthographic data (i.e. mismatching section should have a high distance and matching sections a low distance). This code could be released as a python module if the approach proves useful. Either way it is a good comparison with the ergodic approach.
  • Main objective is to have an end-to-end system in place ASAP.

Weekending 16th November

  • Junichi is running ergodic system over data with help of the orthographic segmentation. (Not helped by complete power cut in Buccleuch!). While I wait for these results (too move ahead on LTS etc), I've returned to DTW to see if I can improve it. First question is what is the space that dtw should be carried out in (before I was arbitrarily normalising mfccs and then increasing the energy by 10). The idea is to train dtw on other speech which does have phonetic differences to tune the system to the perceptual space.
  • Completed rewrite of dtw code. Now more consistent, less buggy and about 5-10 times faster.
  • Put in framework for learning weight parameters in dtw
  • Sanity check comparing across and within phone categories suggests raw dtw is working

Weekending 30th November

  • Re-analysed Junichis ergodic output for both silence conditioned and word boundary conditioned models (using original orthographic segmentation)
  • Helped put together paper for Language and Speech Technology 2008. Indirectly related to ephones in terms of looking at hybrid approaches to HTS and unit selection, sparsity etc.
  • Meeting with Junichi looking at comparison of orthographic segmentation and the ergodic data. Raises the question of retraining (or adding stricter initial constraints drawn from the orthographic segmentation. Current model could be used as input to joint-multigram model in turn used to produce an initial alignment for a standard system. We both decided to look at the data and reflect on the best way forward.

Weekending 21st December

  • Ran set of experiments to set dtw parameters: feature weights

Ideally we want the acoustic space to regard things which are perceptually similar as near and different as far. The orthographic segmentation gives us a set of orthographic sub units. A lexical prior suggests that the same units should generally (but not always) be more perceptually similar. Thus if we train our space to try and bring similar orthographic units together and separate different ones we should have a space closer to our perceptual space.

Using basic hill climbing and a very simple monte-carlo method weights for the dtw paraeters have been trained in this way. They appear fairly stable.

  • Maximum time distortion

Same applies except we can just try a load of values with the learnt weights. A priori we want to allow time distortion. After width 10 frames on the distortion diagonal again it settles down.

  • Segmentation values

Threshold, and minumum match width as well as the peak picking window. Hear we ideally want word boundaries in the orthographic segmentation to be maintained (well often not always). In addtion we would like similar orthographic units to be modelled with a similar number of ephone units. Ran threshold 1500/2000/2500 min widths 10ms/50ms/90ms and peak picking window size 1-20

No clear favorite but 2000 5 2 seems a good choice.

Eyeballing a single file with these (and other values supports this seems to do a reasonable job in terms of modelling underlying acoustics.

  • Re run segmentation

New code and new parameters being used for rerun of segmentation on grid engine.

Plan 2008: January

  • Recluster new segmentation and do a lexical evaluation
  • Resegment using HTK and alternative prons
  • Redo lexical evaluation
  • Use multigram to generate prons
  • Redo lexical evalsuation
  • decide how to map ephones onto ediphones : Mybe standard mid points? Issue about sparsity and how we decide to deal with it. if we have many units.
  • Objective end to end system within speech database lexicon.

Weekending 11th January

  • Segmentation done over xmas looks okay
  • tried python cluster library but too slow for our purposes
  • refactored label comparison code (to compare orthographic segmentation with dtw boundaries)
  • fixed some minor problems with word orhtographic segmentation


  • use comparison code to generate 'forced boundary' dtw label files (where silent locations or word boundaries can be forced onto the dtw segmentation.
  • add code to calculate orthographic context of units (i.e. s2/+s+0)
  • add clustering using memeory loaded feature frames
  • redo output lexicon

Weekending 18th January

  • label comparison code refactores. Now a general tool for comparing two unit sequences applicable both to wavesurfer style lable files and also loaded feature frames from dtw.

  • label boundary forcing code written and tested. This allows boundaries from one label sequence to be forced onto a second either by moving close boundaries or if necessary insertin boundaries. Using current dtw segmentation we can ensure word boundary felicity by inserting 1% boundaries and movingthe others with a MSE of less that 0.001ms.

  • shell script to run python with renewable tickets to allow long jobs within screen.

  • kmeans sample style cluster code rewritten for in memeory ffs. Now 3/4 times faster. Othographic context functionality built in. Currently clustering reference set with octx weight of 1.0. Should be done by tomorrow afternoon.


  • check out lexicon generation for new clusters

  • produce evaluation metric on lexicon generated.

  • try different ctx weights and cluster sizes.

  • use new lexicon for ephone segmentation using cereproc voice building kit.

Weekending 25th January

  • tried various clustering with different othographic weights Higher weights don't converge.

Average number of pronunications/word does drop slightly with higher context weights but has an average of 4/about 1 in a ormal lexicon.

  • Got cereproc segmenter and voice build to work at CSTR

  • segmenting based on cluster00. Who knows maybe the segmentation process will help remove extra pronunciations and help the system converge on decent units. On the other hand...


  • Evaluate the segmentations with the dtw units

  • if its lousy decide what to do next....

Weekending 8th February

  • Segmentation using dtw is suprisingly good both for 0 and high lexical context biasing. Using this lexicon to redo segmentation (using cereproc orthographic segmentation)

  • In addtion the average number of prons / word also drops.

  • Over these two weeks got output into form to load into cerevoice. Voice created for high and low lexical context. Encouraging bu for from acceptable.


  • Us ejoint multigrams to add a constnat multigram generated lexicon for all words to homogonise the pronunciations (use zero lexical context). Then use this to resegment. Hopefully commmon multigrams will win out and we will have a better pronunciation representation whcih is lexically connected. Will revview pba unt selection code written last year to get this running.

Weekending 15th February

  • done as above. Voice building will be finished by Saturday PM.

Weekending 22nd February

  • above didn't work. results worse than basic cluster00 segmentation.

  • new plan see below


  • cluster dtw units within orthographic groups (multi symbols -> new category)

  • cluster quite narrowly (depending on number of items)

  • implement cluster merging algorithm. 1. estimate distance between clusters. for each item in c1 get distance to item in c2. If this is < item in c1 distance to medoid close = True.

  • based on single threshold decide on heuristic on whether clusters can be joined.

  • This should result in at least sensible sym->eph mapping which will produce more homogeneous lexicon

  • Hopefully we can also merge across syms to reduce number of ephones

Edit | Attach | Print version | History: r37 < r36 < r35 < r34 < r33 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r34 - 23 Feb 2008 - 09:54:50 - Main.matthewa
CSTRprojects.EphonesMatthewAylettProgress moved from CSTRprojects.MatthewAylettProgress on 03 Aug 2006 - 09:49 by Main.matthewa
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies