求助:提取mdx词条转换为HTML/PDF

基本用法:

MdxConverter 某某词典.mdx input.xlsx output.pdf

txt编码改为ansi后成功。
测试文件:
2.txt (14 字节) BtD.css (5.7 KB) 1.mdx (1.5 MB)

如果要用.py来运行的话,该安装的模块要提前装一下。

import mdict_query

import openpyxl

import json

import argparse

import os

import re

import pdfkit

from bs4 import BeautifulSoup

import sys

from enum import IntEnum

from collections import OrderedDict


在尝试这个工具,GitHub地址:

用txt和xlsx测试过,都失败了。:
F:\Programs\Mdx\Converter
λ MdxConverter.exe 2.mdx 2.txt 2.pdf
Lesson 1
meanings
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 221, in mdx2pdf
File “MdxConverter.py”, line 206, in mdx2html
File “MdxConverter.py”, line 91, in merge_css
File “MdxConverter.py”, line 87, in get_css
AttributeError: ‘str’ object has no attribute ‘decode’
[26348] Failed to execute script MdxConverter
求教如何解决这个错误,以及如果无法解决有没有可用的工具。

从mdx提取指定的词条,并打包成mdx


(出处: 掌上百科 - PDAWIKI)
这个工具可用,但只能提取为txt,还需要转化为网页或pdf

目前的解决办法是 用Mdict Editor Tool 提取HTML+wkhtmltopdf转化为pdf

Mdict Editor Tool v2.0.35 – 多功能个性化词典制作工具


(出处: 掌上百科 - PDAWIKI)

链接: https://pan.baidu.com/s/1qYO3aA4
密码: 9xs8

1 Like

还是这个好用,只是你的命令用错了而已,建议先照猫画虎,把作者的案例走一遍再说自己的需求。

noword/ MdxConverter

MdxConverter 某某词典.mdx input.xlsx output.pdf

F:\Programs\Mdx\Converter
λ MdxConverter.exe 2.mdx 2.txt 2.pdf
Lesson 1
meanings
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 221, in mdx2pdf
File “MdxConverter.py”, line 206, in mdx2html
File “MdxConverter.py”, line 91, in merge_css
File “MdxConverter.py”, line 87, in get_css
AttributeError: ‘str’ object has no attribute ‘decode’
[26348] Failed to execute script MdxConverter

这次我命令没错还是不行

2.txt (19 字节) 2.mdx (34.3 KB)
测试一下

应该此程序写的有点问题

MdxConverter.exe 这是什么东西?

是python打包的exe

你好,我测试了一下,mdxconverter查找ling.css,外置没有,mdd内也不存在,返回了空字符串导致报错。把MdxConverter.py的第85行css = '‘改成css = b’'后,我测试就不报错了。

多谢!
我加上了css之后,虽然还是报错,但至少生成了一个temp. html,里边就是提取出来的词条meanings
ling.css (198 字节)

现在测试了另一个文件,出现了另一种错误:
F:\Programs\Mdx\Converter
λ MdxConverter.exe 1.mdx 2.txt 2.html
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 150, in mdx2html
File “MdxConverter.py”, line 43, in get_words
File “MdxConverter.py”, line 52, in get_words_from_txt
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xbf in position 2: illegal multibyte sequence
[54672] Failed to execute script MdxConverter

1.mdx (1.5 MB)
请教怎么解决这个gbk解码的问题呢?

掌上百科那个软件没什么问题,提取出txt以后,修改后缀为HTML,配合原来的CSS就可以了。不过文件很大,浏览器几乎不能正常打开。

我用1.mdx测试没有问题

还是这个:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xbf in position 2: illegal multibyte sequence
[76240] Failed to execute script MdxConverter

复现不了该问题,从traceback猜是txt文件的编码问题。

txt的编码可以确定是utf-8,而且同一个txt在两个mdx上测试,只有包含中文的那个出现gbk错误,那么错误应该是在mdx上

你把txt的编码改为ANSI试试,或者把MdxConverter.py第52行,for line in open(name).readlines():改为for line in open(name,encoding=‘utf-8’).readlines():

非常感谢!改为ansi成功了

试图直接输出pdf还是出错
F:\Programs\Mdx\Converter
λ MdxConverter.exe 1.mdx 2.txt 3.pdf
Lesson 1
arm
Traceback (most recent call last):
File “site-packages\pdfkit\configuration.py”, line 21, in init
OSError: [Errno 22] Invalid argument: b’F:\Programs\Mdx\Converter\wkhtmltopdf.exe\r\nF:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 222, in mdx2pdf
File “site-packages\pdfkit\api.py”, line 47, in from_file
File “site-packages\pdfkit\pdfkit.py”, line 42, in init
File “site-packages\pdfkit\configuration.py”, line 27, in init
OSError: No wkhtmltopdf executable found: “b’F:\Programs\Mdx\Converter\wkhtmltopdf.exe\r\nF:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe’”
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
[94796] Failed to execute script MdxConverter

wkhtmltopdf已经安装,但这个进程没法读到这个wkhtmltopdf.exe文件

改这一行for line in open(name,encoding=‘utf-8’).readlines():
再运行出现:F:\Programs\Mdx\Converter
λ MdxConverter.py 1.mdx 2.txt 4.html
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 3, in
import mdict_query
ModuleNotFoundError: No module named ‘mdict_query’

F:\Programs\Mdx\Converter
λ MdxConverter.py 1.mdx 2.txt 4.html
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 3, in
import mdict_query
File “F:\Programs\Mdx\Converter\MdxConverter-master\mdict_query.py”, line 13, in
from .readmdict import MDD, MDX
ImportError: attempted relative import with no known parent package

问题定位到C:\Users\jiang\AppData\Roaming\Python\Python37\site-packages\pdfkit\configuration.py,第14行。原因是环境变量里有多个wkhtmltopdf.exe的路径,问题见


解决办法,将pip默认安装的pdfkit的旧版本,更新到github上的新版本。
运行以下命令
pip install --upgrade git+https://github.com/JazzCore/python-pdfkit.git

下个mdict_query
git clone https://github.com/mmjang/mdict-query
然后将里面的文件复制到MdxConverter文件夹下