MdxScraper：从MDX字典提取内容并输出为PDF、HTML或JPG

Vim · 2024 年1 月 13 日 10:58

经常需要将特定词条的查询结果批量导出并打印，特对 MdxConverter 做了改造加强。

下载：

页面右侧有 Releases 可以下载：GitHub - VimWei/MdxScraper: Extract specific words from an MDX dictionary and generate HTML, PDF, or JPG files with ease.

改进之处：

提升词典兼容性：
- 内置并升级mdict-query，支持多mdd的词典。
- 兼容有或无CSS文件的词典。
- 兼容html中img标签的多种写法。
- 兼容支持png、jpg、jpeg、gif等常见图片格式。
- 支持同一个页面多次重复引用同一图片的情形，如读音图标等。
提升跨平台兼容性：
- 文件路径名，兼容跨平台的多种的写法。
- wkhtmltopdf安装目录，兼容跨平台的多种情形。
重构程序，更加便捷、易用、强健和友好：
- 采用配置文件方式，而非命令行参数，配合conda可以一键输出，更便捷。
- 丰富配置选项，包括输入输出文件、词典文件、PDF排版、CSS等，更强大。
- 输出信息增加程序状态、查询统计、输出地址、耗时等信息，体验更友好。
- 备份原始词汇，并与输出文件放在一起，方便归档调阅，数据安全有保障。
- 增加时间戳到输出文件名，方便归档查阅所有输出文件，文件管理更方便。

last_idol · 2024 年1 月 13 日 11:16

多mdd的支持，其实是在mdict-query.py文件里，需要从头理清原作者的思路。建议还是直接合并mdd更方便。

Vim · 2024 年1 月 14 日 11:32

Update：Release MdxScraper v1.1 · VimWei/MdxScraper · GitHub

Enhancing Compatibility for Windows, Linux, and Mac.
Implement ‘utf-8’ encoding for file handling to enhance compatibility.

demo · 2024 年2 月 28 日 12:18

正在学习 Python, 顺手把 mdict-query 改为支持多 mdd 查询了。
以下是测试代码：

def multi_mdd_test():
    mdx_name = '說文解字.mdx'
    mdx_name = Path(mdx_name)
    dictionary = mdict_query.IndexBuilder(mdx_name)
    css_key = dictionary.get_mdd_keys('\C0001*.png')[0]
    css = dictionary.mdd_lookup(css_key)[0]
    print(css_key)
    print(css)

改好的文件：
mdict_query.zip (3.5 KB)

Vim · 2024 年2 月 29 日 06:25

感谢升级重大特性！我基于此发布了新版本 v2.0

Vim · 2024 年3 月 3 日 18:50

重构程序，全面提升体验，升级到 v3.4
- 采用配置文件方式，而非命令行参数，配合conda可以一键输出，更便捷。
- 丰富配置选项，包括输入输出文件、词典文件、PDF排版、CSS等，更强大。
- 输出信息增加程序状态、查询统计、输出地址、耗时等信息，体验更友好。
- 备份原始词汇，并与输出文件放在一起，方便归档调阅，数据安全有保障。
- 增加时间戳到输出文件名，方便归档查阅所有输出文件，文件管理更方便。

hlmswift · 2024 年7 月 18 日 01:37

关于小白门坎高了一些，windows系统，没有exe或msi等一键安装的文件提供吗？

moran · 2024 年10 月 24 日 16:23

请问运行了主程序出现以下报错怎么处理？谢谢！
Traceback (most recent call last):
File “Desktop/MdxScraper3.6/MdxScraper.py”, line 371, in
found, not_found = {
File “Desktop/MdxScraper3.6/MdxScraper.py”, line 209, in mdx2html
right_soup = BeautifulSoup(‘

’, ‘lxml’)
File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/bs4/init.py”, line 250, in init
raise FeatureNotFound(
bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need to install a parser library?

moran · 2024 年10 月 24 日 16:50

找到问题了。原来是没有装lxml，装好了运行就OK了。谢谢！

xiphiroc · 2025 年5 月 17 日 06:54

谁能本站上传个最新版的zip包， github经常响应时间过长，无法完整加载，访问!!! 谢谢！！

zhu1234 · 2025 年5 月 17 日 09:26

Traceback (most recent call last):
File “C:\MdxScraper-3.6\MdxScraper.py”, line 371, in
found, not_found = {
~
…<2 lines>…
‘jpg’: mdx2jpg,
~~~~~~~~~~~~~~~
}[output_type](mdx_file, input_file, output_file, INVALID_ACTION)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:\MdxScraper-3.6\MdxScraper.py”, line 206, in mdx2html
dictionary = mdict_query.IndexBuilder(mdx_file)
File “C:\MdxScraper-3.6\lib\mdict-query\mdict_query.py”, line 41, in init
assert(os.path.isfile(fname))
~~~~~~~~~~~~~~^^^^^^^
AssertionError

楼主和大神们，报错，怎么解决呀？
另外：这个没装上，pip install base64
安装了一个 pip install base
是不是它导致的？