Wikimedia-MDX – Convert Wikimedia HTML Dumps into MDX

Hi everyone! :waving_hand:

I’ve been working on a Python tool that converts Wikipedia and Wiktionary HTML dumps into MDX dictionary files.

:link: GitHub: GitHub - leanhdung1994/Wikimedia-MDX: Converts Wikipedia and Wiktionary HTML dumps into offline MDX dictionaries · GitHub


What it does

It takes the official .tar.gz HTML dumps from Wikimedia and processes them into .mdx files — complete with the original Wikipedia/Wiktionary styling (CSS & JS included). The result (see e.g. here) looks very close to the real website, just offline.

It supports:

  • Wikipedia: English, French, Japanese, Chinese
  • Wiktionary: English, French.

Adding support for other languages and Wikimedia projects are straightforward.


Is it fast?

Surprisingly, yes! The full English Wikipedia dump (~500 GB uncompressed) processes in about 5 hours on a 16-core machine with 32 GB RAM. The pipeline uses Python multiprocessing, DuckDB for deduplication, and indexed gzip seeking so it never needs to fully extract the archive to disk.


Basic usage

python main.py \
  --proj wiki \       # wiki | wiktionary
  --lang en \         # en | fr
  --input-dir  /path/to/dumps \
  --output-dir /path/to/output

There are also optional flags to control the number of CPU cores used, RAM buffer size, and pruning depth (whether to keep sections like “derived terms” and “translations” in Wiktionary entries).


One important note

Because the resulting MDict .txt file is very large, I’d recommend using my multithreaded fork of mdict-utils to pack it:
:link: GitHub - leanhdung1994/mdict-utils: MDict pack/unpack/list/info tool · GitHub


Screenshots:




5 个赞