-- Main.dwang2 - 30 Sep 2007


In the past two weeks I mainly focus on two things. One is rearranging some results for the Speech Communication paper and ICASSP paper needs. According to the comments, we need provide some more results and clarify several things for the ICASSP paper. Another stuff is the annoying performance degradation with tandem features in English, for which we actually have not yet find the answer. The more depressing thing, is that I find the STD performance is not increased with the tandem features even for the grapheme system, although the WER is reduced 3% absolutely! This observation is different from that in Spanish, in which both ASR and STD performance is increased with tandem features.

I also encountered a strange phenomenon, that when the decoder is invoked twice, different results may be obtained, all of which seem different from what I achieved 1 month ago... Need to track why.


The focus of this week is two papers. Finally with Joe's help we completed the Speech Communication paper and the std paper for ICASSP08. One promising result is that with combining the grapheme system and the phoneme system, some improvement is obtained, especially for INV words, the improvement with the simple voting system is significant. Significant testing shows for FOM values, the improvement is more promising, while for OCC the increase is not so obvious but still significant with level of 0.5. This implies an interesting direction, that we do not need to compete grapheme systems with phoneme ones, and a combination could be more persuasive, since our analysis shows that graphemes describe the acoustic-linguistic relationship from another perspective which is different from phonemes, in essence.


Last week keep on the tandem feature based rt04 STD experiments. It was shown before that the tandem features improved the performance for the grapheme based system however deteriorates the phone-based system. This is a strange result. Some analysis shows that I may did something wrong in the training process (actually I can not reproduce the exact WER as before!). Lots of trying to uncover the problem and now seems pin down to some update which is buggy in the training script. Also, reading some papers, which give the idea that the brute combination of MFCC+tandem in the feature domain is not a good idea. A multi-stream HMM with a proper trainable weights for MFCC and tandem features are more appropriate which at least ensures the combined system is never worse than the MFCC based system alone.


1. Joe has generated the new tandem features, and the new HMM models are being trained on townhill, Not done yet. 2. The problem why I can not reproduce the previous results is still a secrete, even from the first step, with the same training script, the same training MLF, the same training file list, the final HMM is different to some extend (although not significant). I rolled back to the original htk3.4, no consistent results can be observed. "%^&%%"% Compared with the old systems, now new trained MFCC based systems get WER about 1% absolute increase. Now have to based on the current results for comparison with the tandem systems, to ensure all systems come from the same procedure except features. 3. Still working on flexible threshold for STD.


Not updated the progress for long,damn. We (Joe and I) finally got the consistent results for tandem features. In previous experiments, the phone-based system always achieved worse results with tandem features compared with standard MFCCs, although the grapheme-based system demonstrated significant improvement. We track the problem by the raw MFCC and speaker based CMN/CVN, and now found that the for phoneme systems, tandem features do provide slight improvement for the phone-based system, although rather insignificant than the case of grapheme-based systems. The results are published in the webpage http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=129

Now there is a concern for the results: although the grapheme system benefits much from tandem features, the training process for the MLP used phone knowledge, which means extra lexicon information was applied, which, supposed to provide more advantage for the grapheme system, is not so consistent with the grapheme system, which is assumed to be lexicon free.

On the other hand, how about if we use a grapheme-based tandem features for the phone-based system? Is it possible some extra knowledge provided so that the performance can be improved? If so, we will be lucky.

I'm now reading Bishop's new book, "pattern recognition and machine learning", hoping some methodologies I can borrow from machine learning for a deeper understanding of the confusion between grapheme and phones.


1. Joe found some possible problems in current tandem feature extracting. So I now performed experiments based on the mfcc systems. First demonstrated that CMN/CVN based on the whole vector (39-dim) and userID (without siteID and meetingID) achieved the best performance,


This is a comfortable result as we do not need manipulate the 64-dim tandem as supposed, just normalizing the whole vector.

2. Then perform the STD experiments with the new user-based normalization, for the phoneme system and grapheme system respectively,

http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=131 http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=132

The result shows the new normalization dose improve the performance for both the grapheme system and phoneme one, however seems the phoneme system benefits more from the normalization.

Another result is, the system based on the LM from the training transcript performs no worse than that based on the bi-gram trained from a large text corpora, indicating that the meeting domain has its special word sequence. But for higher order LMs, probably the large-text-corpora based LMs are better, considering the data sparsity issue with only the training transcription (which has been demonstrated before on higher order LMs with old phoneme/grapheme systems).

3. It's also found that, the HDecode bi-gram decoding is much slower than using HVite, and generates larger lattice. This is more serious for the grapheme system. I'm checking the reason , as we must need Hdecode to generate lattice with higher-order LMs (although for higher-order LMs the lattice will be quite compact).

4. Speaker adaption is also being performed. Now tuning are kicked off to find the optimum point for adaptation stage (HVite based, grapheme decoding).


This week mainly focusing on the on-line adaptation, to observe the behavior of adaptation for the grapheme system. One product is that, by performing the performance test, finding the operation point for HVite sub-word unit decoding, as well as the performance. http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=133 http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=134

With the found point, perform the on-line adaptation first using HVite, sub-word unit decoder. The unfortunate result I now observe is no obvious improvement. This is not accord with common sense. In the experiment, I selected 1 base-class, without regression tree. Will try to use regression tree (trained using training data), although even the single base-class should give the improvement.

On the other hand, the tandem features, that Joe recalculated are used to build new systems. The systems are ready, but now on tuning. Results from dev set seems better than before.


With lots of experiments, we finally reach the results which I believe confident enough, for the STD tasks with long-span language models in English.

In our ICASSP paper, it has been demonstrated that the grapheme based STD system performs worse than the corresponding phoneme one, with a bi-gram LM trained from the training transcript. To test how wider linguistic context affect these two systems, I applied 2-9 gram LMs to the grapheme system and 2-7 to the phoneme system, with more reliable tuning and testing.

1. Tuning

The tuning process is more accurate.

First, I suppose using the evaluation word list for tuning is not proper. There are three problems, (1) some words, e.g., unusual name entities, do not appear in the dev set, (2) evaluation words are not always known in the tuning step, (3) the tuning based on the evaluation word list lacks of generalization. To prevent the bias, this time I used much more words for the tuning, including almost all words appearing more than 6 times in the dev transcript, except some functional words,


Second, it's found that the lattice size also affects the final performance, and a larger lattice dose not always improve the accuracy. The maximum number of models in the histogram pruning (-u in HDecode) seems a more convenient factor to control the lattice size than the beam width (-t), so this is tuned with respect to the FOM value. Plus the insertion peanlty and LM factor, the tuning process is more reliable, with the help of random search.

The tuning results based on mfcc features are shown in the first table in the following page,


We observe consistent performance improvement for both the grapheme and phoneme system with the order of LMs increasing; and the storage is reduced as well, as expected.

The best order of LM is 8 for the grapheme system, and 7 for the phoneme one, having a similar size of LM entries.

Comparing these two systems, we conclude that the phoneme system is significant better than the grapheme one, with more HITs and less FAs.

The same conclusion can be drawn for tandem-feature based systems. If you scroll down the page and look at the second table, you can find the results.

2. Evaluation

The evaluation results are shown in the following page, in which please focus on the second table (with TH=2000),


The evaluation results are consistent with what we observed in the tuning.

3. Tandem features

Joe and I have worked on the tandem features for some time. The tables in the above pages show the results from tandem feature based systems as well, which say that, although tandem features help improve both the phoneme and grapheme systems and even more for the grapheme system, the grapheme one still can not compete its phoneme counterpart. frown

If you want to check the ASR results with tandem features, please look at the second table (TH=2000) in the following page,


Again, we see the tandem features help improve the performance more for the grapheme system than the phoneme one, however, the later is still much better than the former.


I applied VTLN on wsjcam0 tasks (as they are faster to draw some conclusion). The results show that VTLN can consistently improve the performance for both grapheme and phoneme systems, but no obvious preference to the grapheme system. Here are the results, where please just focus on the red figures in the table, which show the contribution from VTLN.


5. On-line adaptation

Again I applied the on-line adaptation on wsjcam0 tasks. To avoid confusion, I'll not provide the link. The results show that, adaptation helps more for the phoneme system than the grapheme system. This can be expected from the recognition results on the training set, where we see great degradation using graphemes, and the number of context-dependent states are much less in the grapheme system. Please look at the second table in the following page, especially focus on the third column named 'tied states'.


The results indicate that the grapheme system itself is much less accurate than the phoneme one, and even if we adapt the system as if the speaker is in the training set, the grapheme system is still worse, or even more worse.

6. Summary

As we have seen, the current simple grapheme based system can not compete a phoneme based one, intrinsically because it provides a less accurate representation in the acoustic space. On STD tasks, the degradation looks like less but still significant.

To make it success, my simple and tentative thought includes the following two,

(1) As what I proposed in the thesis proposal, to design a better surface layer for acoustic representation, considering wider letter context.

(2) Try to combine the grapheme system with the phoneme one, to help the base phoneme system improve the accuracy. At least in STD tasks, this has been shown feasible, and more complex integrating approaches I would like to design and try.

I'll keep on working through out the Christmas festival, focusing on this combination stuff.


In the holiday, I focused on too things. First, mpe training, second, single-tree clustering.

1. MPE training For MPE training, the consideration is that, the under-layer HMMs, which represent the acoustic implementation, suffer from unwanted ambiguity. A discriminative training might help reduce this ambiguity. I have finished the testing on the wsjcam0 task and ami task. In both cases, MPE training does help the grapheme system more, however the improvement is not as significant as expected. The ASR results are shown here,

wsjcam0 task: http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=146 ami task: http://homepages.inf.ed.ac.uk/cgi/dwang2/cvss_request.pl?account=dwang2&step=view_request&cvssid=151

we see that in the wsjcam0 task, the phoneme system has an accuracy of 91.22 with MPE training, compared to 91.19 with MLE training; for the grapheme system, the accuracy is improved from 86.27 to 86.60.

In the ami task, the accuracy of the phoneme system reach 51.62, from 51.42; however for the grapheme system, the figure is improved from 42.94 to 44.15.

Tests on STD tasks are going on. The current observation is encouraging.

2. single-tree clustering

The reason considering a single-tree clustering for all graphemes instead of conventional individual trees for each grapheme, is that the HMMs of some different graphemes actually share similar pronunciations in the acoustic space, so letting them share acoustic models might help to improve the model power.

Another reason we want a single-tree clustering is, to build a fuzzy mapping from some canonical units (graphemes) to acoustic units, it'll be simpler if we have those acoustic units ready, by then we just need train the mapping between these two layer of units. A single-tree clustering will actually build a set of acoustic units, which can be used as the initial, from which a mapping, which applies long-span context, can be implemented easily. HSM uses the same strategy, although no single-tree is necessary because each phoneme is assumed to be separated in the acoustic space.

Now the training process is going on.


Now the 'single-tree-clustering' is done for the wsjcam0 task. No significant improvement on the eval set, though it dose improve the performance on the training set.


Other two kinds of progress is for the grapheme and phoneme fusion system. One is for a variable fusion factor, using the dev set to determine the optimal factor used for combine the phoneme and grapheme systems. This gives a little improvement for the fusion system.


The second is some intuitive evidence of the system fusion. By checking the relative performance of the two systems, we obtain some vivid impression, that fusion is good. However, what kinds of factors determine the relative performance has not yet been found, although I have tried to test the LM probabilities and word-length.


Topic revision: r10 - 14 Jan 2008 - 18:39:00 - Main.dwang2
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies