A python script kiddie script to scrape wordreference.com bilingual dictionaries and convert it to mdx format.
This script was tested on linux.
The script requires
BeautifulSoup python libraries to work
how to use the script:
python wr-scraper.py DICTIONARY_CODE wordlist outputfile
DICTIONARY_CODE is the code for desired dictionary. for example
enzh for English-Chinese dictionary and
enar for English-Arabic dictionary.
To get a list for all available dictionary codes :
python wr-scraper.py -l
For example: assuming you have a wordlist file named
wordlist.txt and you want to scrape the English-Chinese dictionary:
wr-scraper.py zhen wordlist.txt EnZh-dict.txt
To download audio files too: create a folder with the name
sound next to the script and use the
wr-scraper.py zhen -a wordlist.txt EnZh-dict.txt
- Words that are not available on WordReference will be written to a file named
- WordReference website shows a captcha after scraping a lot of words (between 1500 and 2000 words in my case) so the script warns you about that and asks you to go to the website and solve the captcha and then press any key to continue.
- Audio file names don’t always reflect what they actually are, for example the audio file names for UK English is named
word-generalon the website so the script use
generalfor audio link in the dictionary. So you might need to do a simple search-and-replace to fix that.
- the script checks if the audio file exists in the
soundfolder before downloading it, so if you want to make the process a little faster and reduce the load on the website server you can unpack audio files from other dictionaries
.mddfiles if available. I made an English-Arabic dictionary with audio so you can get it here and unpack the mdd file to get English sounds if you want to make an English dictionary.
wr-scraper.py (8.6 KB)