Common Voice - Pronunciation Dictionaries (raw data)

tovaremeterio · 2022 年6 月 2 日 09:43

The Mozilla Foundation is developing the “Common Voice” Project.

“Common Voice” is an excellent repository of audio recordings with transcriptions. The data can be freely downloaded and is available in >50 languages.

Recordings by Language (June 2022):

English: 2500 hours
German: 1200 hours
French: 1200 hours

The raw data comprises .mp3 audios and a .csv file with mappings in 2 columns (audio filename + transcription). For example: GTS1681768.mp3 / The dog is sleeping.

Using the .csv file mappings it is possible to batch rename audios to make “Sound Libraries for GoldenDict” or it is also possible to compile .mdx dictionaries. Here is a demo video for German:

demo video of a Sound Library on GoldenDict (German).zip (2.3 MB)

I am making a “Sound Library” for GoldenDict with the German Common Voice. A total of 1200 hours of recordings in short sentences (up to 14 words/14 seconds). The number of sentences would be >1 million and the size of the dictionary around 10 GiB in .opus format.

Any feedback or ideas regarding a German Sound Library is welcome. I will finish the project this week. Please see the video above. I will publish the data as “Sound Directory” for GD instead of a .mdx/mdd.

Feedback is welcome to all members of this forum @dedict , @Existentialismus , @shiruxue , @GDictFan , @xiaoyifang , @hua

xiaoyifang · 2022 年6 月 2 日 10:56

well done