Use this page to track issues with the GlobalPhone corpus, i.e. things such as silent audio files, missing transcriptions etc. Some of this problems don't matter but others do. If you can turn this into one of those clever tables you see on wikipedia where you can sort by different columns that'd be great (one column could be language and so you could focus on the language you're working on) - not sure if twiki can do that.

1. The docs mention a transcription convention of having non-verbal sounds and stuttered words written within angle brackets. In (at least) Arabic and Swedish angle brackets are also used to indicate numbers e.g.

; 16:
wa Altazamtu bi+kulli il+bunUdi il+QAnUnEyaB wa amDaytu il+caQda bi+tArEx <1996/3/30>
; 89:
wa tuQaddar HAjiyAti il+ttaclEmi il+cAlE xilAla il+ssabca sanawAti il+QAdimaB bi <3500> ustAV jAmicE

; 81:
I dag finns det drygt tolvtusen <#12_000> svenska <E5>kerier registrerade f<F6>r yrkesm<E4>ssig trafik

; 101:
marknadsandelar Mercedes fyrtioen <#41> komma fyra <#4> procent MAN tjugonio <#29> procent Volvo <E5>tta <#8> komma fem <#5> procent Scania sju <#7> komma sju <#7> procent Iveco sex <#6> komma sju <#7> procent och <F6>vriga m<E4>rken sex <#6> komma sju <#7> procent

In Swedish, numbers seem to be written in digits after being written in words - is that a convention people stuck to? Is the same true of Arabic? The same notation seems to be used for foreign words:

; 56:
<lC> <manaCl> <dC> <fransaCs> <GrammaCrC> <fransaCsC> <sCCnsCs> <fCysCQaCs> <t> <dakatCCn> <tCknCQaC> <bCshCttC> <dC> <tb>

However, the following utterances contain foreign words outside angle brackets (and/or characters outside the the set specified in the Character Chart):
AR003.rmn 5 & 6, AR035.rmn 9 & 10, AR055.rmn 25-33
Would it be safest just to ignore text within angle brackets?
And what does <#> signify in the Russian trancsriptions, in e.g. RU068.rmn?

2. The Mandarin corpus partitioning seems to have some articles appearing in more than one set (looking at .spk files). The table below shows the speakers speaking each duplicated article in each set:

article eval dev train
a0602.004 032 033
a0620.004 089 063
b0620.002 089 063,090
e0531.004 080 079

3. The audio for speakers CR021 and CR091 in v2.1 of the Croatian database seems to be missing.

4. CR088.rmn and CR088.trl have the comment indicating utterance 7 as part of the transcription for utterance 6, not on a new line (this just makes life difficult for one of my scripts - I could fix this locally)

5. I can't see any romanised transcriptions for the Czech data - are there any?

6. Trying to decompress utterances GE003_2, GE004_34 and GE004_181 results in a "premature EOF on compressed stream" error with shorten.

7. The Portuguese development set contains two speakers, PO065 and PO073, who don't exist (no audio or .spk file).

8. Files PO136_1.adc.shn through to PO136_50.adc.shn inclusive cause shorten to give a "No magic number" error when decompressing.

9. PO136 has 73 utterances transcribed but 73 audio files - the first 50 are empty (giving the "No magic number" error mentioned earlier on) while the remainder are valid .shn files but empty. ...There were problems extracting features from seven other Portuguese speakers. The DVD was noisier than for the other languages, if that means anything, so I'm wondering if there's a problem with the disc? The problematic speakers were 14, 15, 17, 22, 26, 31, 36, 58 and 136. Problem utterances were at least PO014_125, PO014_126, PO015_100, PO017_160, PO022_168, PO026_21, PO031_52, PO036_27, PO058_16, PO058_18 and PO136_10. I haven't looked at this any further, in case it just turns out to be a scratched DVD. Need to sort out PO136 before reporting results since it's an eval-set speaker.

10. There's another (minor) problem with table 9 - the number of Russian utterances is one less than stated. Utterance RU005_9 seems to be repeated in the corpus, once as RU005_9.adc.ori.shn and again as RU005_9.adc.shn (the file contents are identical).

11. The 30th utterance in SW005.trl contains a nbsp character.

12. The following audio files have no transcription:
AR005_76 silent
AR006_54 silent
AR006_58 silent
AR006_63 silent
AR006_70 silent
AR029_1 silent
AR104_1 silent
CH046 33 utterances transcribed, 33 non-empty audio files but 77 audio files in total.
CH064_128 recording exists, but no transcription
CH073_117 recording exists, but no transcription
CH073_118 recording exists, but no transcription
CH073_119 recording exists, but no transcription
CH073_62 recording exists, but no transcription
CH073_63 recording exists, but no transcription
CH076_103 recording exists, but no transcription
CH076_116 recording exists, but no transcription
CH076_117 recording exists, but no transcription
CH091_43 recording exists, but no transcription
CR062_42 silent
CR072_15 recording exists, but no transcription
PO003_61 silent
PO014_117 silent
PO015 66 utterances transcribed, 88 non-empty audio files and 126 audio files in total.
PO021_172 silent
PO031_36 silent
PO031_52 silent
RU006_121 silent
RU114_13 silent
SP007_15 recording exists, but no transcription
SP007_16 recording exists, but no transcription
SP007_18 recording exists, but no transcription
SP035 first 57 utterances are untranscribed
SP040 utterances 56 and 57 are untranscribed
SP041 first 87 utterances are untranscribed
SP058 utterances 45 till 73 are untranscribed
SP078 first 53 utterances are untranscribed
SP077_56 recording exists, but no transcription
SW015_20 recording exists, but no transcription (ignoring for now,
replace when fixed)

22. Edit utterance 60 in files GE051.trl and GE051.rmn such that "3D-Methode" reads "3 D Methode" - this would avoid an OOV

24. Edit utterance 15 in files GE077.trl and GE077.rmn such that "15-jahres-frist" reads "15 jahres frist" - this would avoid an OOV


13. The documentation for the dictionaries seems to have a typo, I'd put:

< Portuguese: 54162
< well documented in the corresponding pdfs, such as CZ, FR, PO.
< The phone set for CH seems undocumented but self-explanatory.
> well documented in the corresponding pdfs, such as CZ, FR, PO, SP.

14. Utterance 171 of GE030 ends with the word petitessen, which doesn't appear in the dictionary

15. Some languages seem to have lots of words without pronunciations (the Arabic figure seems to be a mistake on my part and should be ignored):
lang #oov
ar 20121
ch 3
cr 89
cz 778
fr 673
ge 0
po 2162
ru 619
sp 1246
sw 0

26. Replace commas and dots in the cz corpus (trl) with spaces. There aren't any words in the dictionary containing either of those characters so this won't turn an in-vocabulary word into an oov. (this can be done locally at Edinburgh (27))

28. Turn French words beginning with l' or d' into two separate words e.g. l'accident -> l' accident (this can be done locally (29))

30. What's the difference between M_+QK and M_+hGH ?

31. The German dictionary contains two identical pronunciations for achtundzwanzig

32. The speaker in GE005_144 only says half of what the transcription says he says

33. The speaker in SW086_84 only says half of what the transcription says she says

34. The speaker in GE011_122 doesn't seem to read the last few words of the transcribed utterance

35. Utterance SW050_94 causes problems when I try to realign it. I can't confirm that the speaker is saying what is in the transcription (but I don't speak Swedish)

36. The following utterances have transcriptions but not recordings:


37. The audio keeps cutting in and out on utterances GE025_64 and GE040_94

38. Utterances SP001_18, SP005_24, SP014_72, SP016_25, SP018_7, SP054_27 and SP079_1 cause problems when I try to realign - the speaker seems to be reading only the first half of the transcribed utterance

39. Utterances CR005_0, CR002_77 and CR030_25 have no transcription

40. The recording for utterance CR060_16 doesn't seem to be the same as the one transcribed (but I don't speak Croatian)

41. The speakers seems to be speaking only the end of utterance CR069_11

42. Utterances CR026_17, and CR048_36 break the forced alignment script, don't know why. Also CR047_16 CR047_17...CR047_44, CR051_30...CR051_53, CR054_47... maybe this is down to those mysterious killings - fix that and then come back to this - that seems fixed, waiting to retest this

43. The recording of utterance SW067_113 is cut off early

44. In Croatian, what IPA phones do M_cp, M_dp and M_zj map to?

45. In Swedish, what IPA phones do M_uxl, M_cr, M_lr, M_nr, M_tr, and M_ox map to?

46. Does an l at the end of a phone, e.g. M_uel, denote vowel lengthening?

47. The final word in SP042_20 is cut off halfway

48. The transcription for utterance SP046_150 reads "...llegan a cerca de los de presupuesto que..." when in fact the speaker says "llegan a cerca de presupuesto que"

49. The transcription for SP086_50 states "Poli+tica" is said twice but it's only said once.

50. Where the transcription for SP094_13 says 1995 the speaker actually said 95

51. The end of the audio for utterance SW009_125 seems to be cut off

52. The end of the audio for utterance CR083_58 seems to be cut off (there might also be an untranscribed word fragment before "Medical")

53. I can't see the GlobalPhone phones pes, tes or kes (seen on that annotated IPA chart) appearing in any of the dictionaries - are they used somewhere

54. Is there a chart mapping the Arabic, Mandarin, Czech and Russian phones to IPA phones? For example, what does a tilde signify in the Russian dictionary?

55. The speaker appears to say international rather than national, in GE007_22

56. Missing transcription: CH025_76, CH051_81, CH063_121, CH084_103, CH084_117, CH091_43

57. In the Russian dictionary, I'm assuming the tilde after a consonant means it's a palatized phone but what are jscH jscH~ ?

58. SP049_1 was missing the word "Nicaragua" in transcription

59. PO111_63, PO111_66, PO112_20 excluded because of dictionary problems - fix your dictionary scripts and re-include

60. Local issue: removed dollar sign from SP014_77 and put in a "do+lares" at the end to things easier. Also, SP022_29: 2/3 --> dos tercios, SP002_62: removed /

61. CH068_53 has OOV word tong3yi1zhan4xian4ta1. tong3yi1zhan4xian4 is in the lexicon but I don't know if that's what was intended, so excluding that utterance.

62. Excluded RU006_41 RU008_9 RU008_10 RU013_92 RU013_130 RU015_75 RU032_12 RU037_56 RU037_57 RU050_73 RU050_76 RU052_37 RU073_42 RU086_21 RU093_12 RU099_13 - put them back once you know how to treat the % sign

63. CH025_53, CH032_9, CH035_50, CH038_39, CH042_13, CH044_55, CH046_33, CH047_67, CH055_63, CH059_26, CH063_102, CH064_73, CH073_112, CH073_113, CH076_11, CH076_100, CH081_101, CH084_110, CH084_111, CH084_113, CH088_103, CH107_19, CH109_36, CH119_30, CH126_62, CH132_10 don't align - ask Songfang to listen to them, exclude from corpus for now

64. PO101_1 - excluding because of / in date - find out what is actually said in PO101_1 and PO101_2 and put back in

65. Confirm transcription of PO115_59, PO116_22 (SANT'ANNA?) and then put back

66. RU006_121 and RU114_13 have no transcriptions, so excluded. (find out what the audio says)

67. PO113_1, PO113_6 and PO142_16 - transcription edited to remove ` and '

68. removed SIL from the provided Portuguese dictionary data/globalphone-po/lex/phoneme-po/Portuguese-GPDict.txt (it doesn't appear in transcriptions)

69. Changed the word SIL to SILENTPHONE in the Russian original dictionary

70. Removed {ge1bei5} {{g WB} e1 b {ei5 WB}} and {mei4mei5} {{m WB} ei4 m {ei5 WB}} from the Mandarin dictionary because they used the phone ei5 which doesn't appear in the data

71. Removed {Nejwif~yuel~} {{n WB} e j w i f~ yu e {l~ WB}} from the Russian dictionary because it contains the only word with a palatized f, of which there is only one instance in the corpus. Also excluded that instance, RU104_94.

72. CH096_60 excluded because of alignment problems

73. Removed {di4xiong5} {{d WB} i4 x io5 {ng WB}} from the Mandarin dictionary because it contains the only instance of io5, which doesn't appear in the data.

74. Removed PO003_86, PO004_45, PO139_34 because of alignment problems

75. Removed {tong4kuai5} {{t WB} o4 ng k {uai5 WB}} and {tong4kuai5lin4li2} {{t WB} o4 ng k uai5 l i4 n l {i2 WB}} from the Mandarin dictionary because they used phone uai5 which doesn't appear the data

76. Removed {lou4ma3jiao5} {{l WB} ou4 m a3 j {iao5 WB}} from the Mandarin dictionary because it used phone iao5 which doesn't appear the data

77. Removed {tun5} {{t WB} ue5 {n WB}} from the Mandarin dictionary because it used ue5 which doesn't appear in the data

78. RU006_98 and RU059_51 failed to align - excluding

79. RU065_34 is much longer than other utterances (6x median length for that speaker) - listen to and possibly exclude

-- Main.s0565860 - 15 Oct 2009

