A python script kiddie script to scrape wordreference.com bilingual dictionaries and convert it to mdx format.
This script was tested on linux.
The script requires requests
and BeautifulSoup
python libraries to work
how to use the script:
python wr-scraper.py DICTIONARY_CODE wordlist outputfile
where DICTIONARY_CODE
is the code for desired dictionary. for example enzh
for English-Chinese dictionary and enar
for English-Arabic dictionary.
To get a list for all available dictionary codes :
python wr-scraper.py -l
For example: assuming you have a wordlist file named wordlist.txt
and you want to scrape the English-Chinese dictionary:
wr-scraper.py zhen wordlist.txt EnZh-dict.txt
To download audio files too: create a folder with the name sound
next to the script and use the -a
option:
wr-scraper.py zhen -a wordlist.txt EnZh-dict.txt
Important Notes:
- Words that are not available on WordReference will be written to a file named
errors.txt
- WordReference website shows a captcha after scraping a lot of words (between 1500 and 2000 words in my case) so the script warns you about that and asks you to go to the website and solve the captcha and then press any key to continue.
- Audio file names don’t always reflect what they actually are, for example the audio file names for UK English is named
word-general
on the website so the script usegeneral
for audio link in the dictionary. So you might need to do a simple search-and-replace to fix that. - the script checks if the audio file exists in the
sound
folder before downloading it, so if you want to make the process a little faster and reduce the load on the website server you can unpack audio files from other dictionaries.mdd
files if available. I made an English-Arabic dictionary with audio so you can get it here and unpack the mdd file to get English sounds if you want to make an English dictionary.
wr-scraper.py (8.6 KB)