TWiki> CSTR Web>EPhones>EphonesPharaoh (06 Sep 2006, Main.matthewa)EditAttach

-- Main.matthewa - 06 Sep 2006 Project Home

Pharaoh

Scripts to build input for Phil Koehns MT system for doing LTS are in ephones/mt/scripts

Results using top 20k frequent words to train on the rest in teh cereproc cmu lexicon produced 16% accurate phone sequence (stress removed and all words with anything other than a hypen apostrophe or underscore.

This goes up to approximately 20% if we alter the phrase penalty from 2.718 to 0.1 and the weight of the language model from 0.5 to 0.1

System is run -monotone to prevent reordering.

The main difference in this is the number of transcriptions with vowel errors only drops from 16600 to 12600 (out of 108k words).

Crib for running it

  1. cd ephones/mt/scripts; mkdir ../corpus ../corpus_test ../faro
  2. python
  3. import make_data
  4. make_data.make_data('../../python_pba/cereproc_cmu_0.6_20k.lex', '../corpus')
  5. make_data.make_data('../../python_pba/cereproc_cmu_0.6_rare.lex', '../corpus_test')
  6. exit python; cd ..
  7. pharaoh/train-phrase-model.perl.2005-08-04 --root-dir faro --e pn --f lt --corpus corpus/words >& LOG
  8. edit pharaoh.ini as required
  9. gunzip model/phrase_table.gz
  10. cat ../corpus_test/words.lt | ../pharaoh/pharaoh.2006-05-17 -monotone -f model/pharaoh.ini -monotone > ../eval

Results

  1. cd ephones/mt/scripts; mkdir ../corpus ../corpus_test ../faro
  2. python
  3. import make_data
  4. make_data.eval('../eval', '../corpus_test/words.pn', '../corpus_test/words.lt')

lm 0.1, phrase penalty 0.1

87571 21785 109356 0.199211748784 45804 29240 12527

lm 0.5 phrase penalty 0.5

91721 17635 109356 0.161262299279 44559 30519 16643

Wrong, Right, Total, Ratio correct, different lengths, different vowels, different consonants

Topic revision: r1 - 06 Sep 2006 - 14:46:12 - Main.matthewa
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies