WordReference python scraper

Wissam · 2023 年11 月 20 日 14:47

A python script kiddie script to scrape wordreference.com bilingual dictionaries and convert it to mdx format.
This script was tested on linux.
The script requires requests and BeautifulSoup python libraries to work

how to use the script:

python wr-scraper.py  DICTIONARY_CODE wordlist outputfile

where DICTIONARY_CODE is the code for desired dictionary. for example enzh for English-Chinese dictionary and enar for English-Arabic dictionary.
To get a list for all available dictionary codes :

python wr-scraper.py -l

For example: assuming you have a wordlist file named wordlist.txt and you want to scrape the English-Chinese dictionary:

wr-scraper.py zhen wordlist.txt EnZh-dict.txt

To download audio files too: create a folder with the name sound next to the script and use the -a option:

wr-scraper.py zhen -a  wordlist.txt EnZh-dict.txt

Important Notes:

Words that are not available on WordReference will be written to a file named errors.txt
WordReference website shows a captcha after scraping a lot of words (between 1500 and 2000 words in my case) so the script warns you about that and asks you to go to the website and solve the captcha and then press any key to continue.
Audio file names don’t always reflect what they actually are, for example the audio file names for UK English is named word-general on the website so the script use general for audio link in the dictionary. So you might need to do a simple search-and-replace to fix that.
the script checks if the audio file exists in the sound folder before downloading it, so if you want to make the process a little faster and reduce the load on the website server you can unpack audio files from other dictionaries .mdd files if available. I made an English-Arabic dictionary with audio so you can get it here and unpack the mdd file to get English sounds if you want to make an English dictionary.

wr-scraper.py (8.6 KB)

Howie · 2024 年2 月 2 日 02:49

I tired but it doesnt work. How do I get esen and enes wordlists?

Wissam · 2024 年2 月 4 日 07:59

you should provide your own wordlist. you can google that. there are multiple wordlists for multiple languages.

gutocwb97 · 2024 年3 月 2 日 12:28

How do I clean up the data scraping and assemble the dictionary after scraping, my friend?

Wissam · 2024 年3 月 2 日 19:36

Sorry can you be more specific please?
Did you mean converting to mdx and mdd?
Because there is nothing to clean

ppxia · 2024 年3 月 4 日 05:56

Any chance that adding some code to get Random House related dictionaries here? Though I’m not sure if they update the dictionaries since 2021.

Oh, sorry, please ignore my previous request. The RH page structure is so different with bilingual page.

user3 · 2024 年10 月 3 日 15:38

“Did you mean converting to mdx and mdd?” Yes, that’s my question, how did you do that?

user3 · 2024 年10 月 5 日 14:39

I was a bit lost at first, so I’ve added some steps on GitHub to help others through the process. Check it out: GitHub - anatolepain/wr2mdx

liuk · 2024 年10 月 31 日 20:03

it would be a little bit better if it created another CSS class for a word class, because now we cannot stylise since it’s the translation + word class together