Common Voice - Pronunciation Dictionaries (raw data)

The Mozilla Foundation is developing the “Common Voice” Project.

“Common Voice” is an excellent repository of audio recordings with transcriptions. The data can be freely downloaded and is available in >50 languages.

Recordings by Language (June 2022):

  • English: 2500 hours
  • German: 1200 hours
  • French: 1200 hours

The raw data comprises .mp3 audios and a .csv file with mappings in 2 columns (audio filename + transcription). For example: GTS1681768.mp3 / The dog is sleeping.

Using the .csv file mappings it is possible to batch rename audios to make “Sound Libraries for GoldenDict” or it is also possible to compile .mdx dictionaries. Here is a demo video for German:

demo video of a Sound Library on GoldenDict (German).zip (2.3 MB)

grafik

I am making a “Sound Library” for GoldenDict with the German Common Voice. A total of 1200 hours of recordings in short sentences (up to 14 words/14 seconds). The number of sentences would be >1 million and the size of the dictionary around 10 GiB in .opus format.

Any feedback or ideas regarding a German Sound Library is welcome. I will finish the project this week. Please see the video above. I will publish the data as “Sound Directory” for GD instead of a .mdx/mdd.

Feedback is welcome to all members of this forum :smiley: @dedict , @Existentialismus , @shiruxue , @GDictFan , @xiaoyifang , @hua

9 Likes

well done

3 Likes