-- Main.jyamagis - 09 Nov 2007 Project Home

9th Nov 2007

Using CSTR nina's speech data, BIC values (strictly speaking, minus BIC) were calculated using Ergodic HMMs. 10 to 150 ergodic states with 1 to 10 interconnected states were investigated. Speech data used was 1,000 utterances. Silences were excluded from the training of the ergodic HMMs. The optimal condition in this experiment was 130 ergodic states with 2 interconnected states. However, it would be better for calculate these values for 160 to 200 ergodic states for confirmation.

12th Nov 2007

Since the above data was not identical to Matthews's one, the Ergodic HMM was calculated once again using Matthew's data. In the Matthew's Nina's speech data, sampling rate was modified to change her voice quality to Matthew's personal preference! and CSTR's speech data which I used was trimmed in silence regions. The ergodic HMM has been trained using orthographic segmentation results which Matthew created. In the training, silence/short pause boundaries and/or word boundaries were kept. (i.e., two kinds of HMM was trained.) Although I do not evaluate the accuracy of the orthographic segmentation, it would not be so bad. This made the training speech of the Ergodic HMMs faster since the width of the trellis in the embedded training can be narrower.

Then I polished up a new idea on unsupervised training for HMM-based speech synthesis. If we build TTS from real scratch, Hierarchical Dirichlet Process + HMM (=iHMM: very bad name,,, This is not markov process anymore. Thus I think hidden Dirichlet model would be rigorous and better name.) would be the best idea since the hierarchical Dirichlet process can easily determine the appropriate number of clusters. In the BIC-based one, we have to calculate all the possible patterns.

But, it is not easy to either automatically create lexicon or convert existing lexicon.

Thus, it would be a good way for me to start a simple idea. The following is the rough idea for the unsupervised training for HMM-based speech synthesis.

  • Orthographic segmentation / Word-based segmentation: Obtain word boundaries from speech with transcriptions.

  • AF recognition and calculate its posteriori features: Optimize the number of units of HMMs in each word using the posteriori features of AF recognizer, STRAIGHT mel-cepstrum, and logF0. Here the number of states in each unit is fixed. We start an HMM having 5-states and calculate its BIC. Then we increase the number of HMMs (5*2)and calculate its BIC again. Then we conduct this procedures until the length of the HMM exceeds the number of frames in each word. We may use # of graphenes in each word since the number of unit in each word would be #graphene-C < #units < #graphene+C. Here the unit is a combination of recognized AF features.

  • Extract main articulatory features: Determine a combination of typical articulatory features in each the optimized unit of each word as explanatory variables for the units (PCA??,,, require more research)

  • Extract language-independent explanatory variables:
    • Word-level variables: # of units in the current word, # of graphenes in the current word, distance from beginning silence, & distance from end of silence.
    • Phrase-level (pause-to-pause region) variables: # of units in the current phrase, # of graphenes in the current phrase, # of words in the current phrase, distance from beginning silence, & distance from end of silence.
    • Sentence-level variables: # of units in the sentence, # of graphenes in the sentence, # of words in the sentence, & # of phrases in the sentence.

  • Generate questions on articulatory features and language-independent explanatory variables:

  • Train HMMs and conduct clustering of the distributions using the above questions.

  • Convert existing lexicon dictionary: Once we define the relation between AF and current phone-set, it is easy to convert the phone-set to AF sequence.

  • Adding new lexicon: Use AF recognizer and extract AF sequence from speech.

  • Lexicon for unknown word: Train Graphenes to AF mapping model or treat them as missing values.

  • Synthesis speech: Extract the language-independent explanatory variables from given text. Pick up AF sequence from AF lexicon. If AF features includes missing values, select all possible HMMs from context-clustering decision trees, simultaneously optimize it and choose the best one in the sense of ML criterion.

Issues: There is no labels for pitch pattern whreras F0 chilt would be modeled by the language-independent variables such as # of units/graphenes in a phrase.

Excepting the pitch pattern, this seems works well. As an AF recognizer, Joe's one would be good. But, eventually, language-independent AF recognizer would be required.

Topic attachments
I Attachment Action Size Date Who Comment
pngtiff Ergodic-BIC-2D.tiff manage 24.5 K 12 Nov 2007 - 17:46 Main.jyamagis Minus BIC values for Ergodic HMM
pngtiff Ergodic-BIC-3D.tiff manage 27.5 K 12 Nov 2007 - 17:47 Main.jyamagis Minus BIC values for Ergodic HMM (3D view)
elsegz ergo.tar.gz manage 8540.1 K 12 Nov 2007 - 18:21 Main.jyamagis Ergo samples
wavwav ergo10-3.wav manage 192.4 K 12 Nov 2007 - 18:22 Main.jyamagis Ergodic 10*3
wavwav ergo120-3.wav manage 192.7 K 12 Nov 2007 - 18:24 Main.jyamagis Ergodic 120*3
wavwav ergo40-3.wav manage 192.5 K 12 Nov 2007 - 18:23 Main.jyamagis Ergodic 40*3
wavwav ergo80-3.wav manage 192.7 K 12 Nov 2007 - 18:23 Main.jyamagis Ergodic 80*3
wavwav nina3_107_001.wav manage 192.5 K 12 Nov 2007 - 18:18 Main.jyamagis Ergodic 40*3
wavwav nina_01_001.wav manage 411.3 K 12 Nov 2007 - 18:10 Main.jyamagis Ergodic 20*3
Topic revision: r2 - 12 Nov 2007 - 18:52:48 - Main.jyamagis
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies