求助：提取mdx词条转换为HTML/PDF

hkreporter · 2020 年11 月 1 日 23:01

基本用法：

MdxConverter 某某词典.mdx input.xlsx output.pdf

txt编码改为ansi后成功。
测试文件：
2.txt (14 字节) BtD.css (5.7 KB) 1.mdx (1.5 MB)

如果要用.py来运行的话，该安装的模块要提前装一下。

import mdict_query

import openpyxl

import json

import argparse

import os

import re

import pdfkit

from bs4 import BeautifulSoup

import sys

from enum import IntEnum

from collections import OrderedDict

https://www.pdawiki.com/forum/thread-38923-1-1.html

在尝试这个工具，GitHub地址：

用txt和xlsx测试过，都失败了。：
F:\Programs\Mdx\Converter
λ MdxConverter.exe 2.mdx 2.txt 2.pdf
Lesson 1
meanings
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 221, in mdx2pdf
File “MdxConverter.py”, line 206, in mdx2html
File “MdxConverter.py”, line 91, in merge_css
File “MdxConverter.py”, line 87, in get_css
AttributeError: ‘str’ object has no attribute ‘decode’
[26348] Failed to execute script MdxConverter
求教如何解决这个错误，以及如果无法解决有没有可用的工具。

从mdx提取指定的词条，并打包成mdx
https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=35433
(出处: 掌上百科 - PDAWIKI)
这个工具可用，但只能提取为txt，还需要转化为网页或pdf

目前的解决办法是用Mdict Editor Tool 提取HTML+wkhtmltopdf转化为pdf

Mdict Editor Tool v2.0.35 – 多功能个性化词典制作工具
https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=18986
(出处: 掌上百科 - PDAWIKI)

链接: https://pan.baidu.com/s/1qYO3aA4
密码: 9xs8

Vim · 2020 年11 月 2 日 01:10

还是这个好用，只是你的命令用错了而已，建议先照猫画虎，把作者的案例走一遍再说自己的需求。

noword/ MdxConverter

hkreporter · 2020 年11 月 2 日 01:17

MdxConverter 某某词典.mdx input.xlsx output.pdf

F:\Programs\Mdx\Converter
λ MdxConverter.exe 2.mdx 2.txt 2.pdf
Lesson 1
meanings
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 221, in mdx2pdf
File “MdxConverter.py”, line 206, in mdx2html
File “MdxConverter.py”, line 91, in merge_css
File “MdxConverter.py”, line 87, in get_css
AttributeError: ‘str’ object has no attribute ‘decode’
[26348] Failed to execute script MdxConverter

这次我命令没错还是不行

hkreporter · 2020 年11 月 2 日 01:25

2.txt (19 字节) 2.mdx (34.3 KB)
测试一下

hua · 2020 年11 月 2 日 01:33

应该此程序写的有点问题

Vim · 2020 年11 月 2 日 01:49

MdxConverter.exe 这是什么东西？

hkreporter · 2020 年11 月 2 日 01:50

是python打包的exe

jns · 2020 年11 月 2 日 02:03

你好，我测试了一下，mdxconverter查找ling.css，外置没有，mdd内也不存在，返回了空字符串导致报错。把MdxConverter.py的第85行css = '‘改成css = b’'后，我测试就不报错了。

hkreporter · 2020 年11 月 2 日 02:09

多谢！
我加上了css之后，虽然还是报错，但至少生成了一个temp. html，里边就是提取出来的词条meanings
ling.css (198 字节)

现在测试了另一个文件，出现了另一种错误：
F:\Programs\Mdx\Converter
λ MdxConverter.exe 1.mdx 2.txt 2.html
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 150, in mdx2html
File “MdxConverter.py”, line 43, in get_words
File “MdxConverter.py”, line 52, in get_words_from_txt
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xbf in position 2: illegal multibyte sequence
[54672] Failed to execute script MdxConverter

1.mdx (1.5 MB)
请教怎么解决这个gbk解码的问题呢？

qujinzhi · 2020 年11 月 2 日 02:22

掌上百科那个软件没什么问题，提取出txt以后，修改后缀为HTML，配合原来的CSS就可以了。不过文件很大，浏览器几乎不能正常打开。

jns · 2020 年11 月 2 日 02:43

我用1.mdx测试没有问题

hkreporter · 2020 年11 月 2 日 02:51

还是这个：
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xbf in position 2: illegal multibyte sequence
[76240] Failed to execute script MdxConverter

jns · 2020 年11 月 2 日 03:03

复现不了该问题，从traceback猜是txt文件的编码问题。

hkreporter · 2020 年11 月 2 日 03:12

txt的编码可以确定是utf-8，而且同一个txt在两个mdx上测试，只有包含中文的那个出现gbk错误，那么错误应该是在mdx上

jns · 2020 年11 月 2 日 03:21

你把txt的编码改为ANSI试试，或者把MdxConverter.py第52行，for line in open(name).readlines():改为for line in open(name,encoding=‘utf-8’).readlines():

hkreporter · 2020 年11 月 2 日 03:30

非常感谢！改为ansi成功了

hkreporter · 2020 年11 月 2 日 03:33

试图直接输出pdf还是出错
F:\Programs\Mdx\Converter
λ MdxConverter.exe 1.mdx 2.txt 3.pdf
Lesson 1
arm
Traceback (most recent call last):
File “site-packages\pdfkit\configuration.py”, line 21, in init
OSError: [Errno 22] Invalid argument: b’F:\Programs\Mdx\Converter\wkhtmltopdf.exe\r\nF:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 222, in mdx2pdf
File “site-packages\pdfkit\api.py”, line 47, in from_file
File “site-packages\pdfkit\pdfkit.py”, line 42, in init
File “site-packages\pdfkit\configuration.py”, line 27, in init
OSError: No wkhtmltopdf executable found: “b’F:\Programs\Mdx\Converter\wkhtmltopdf.exe\r\nF:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe’”
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - Installing wkhtmltopdf · JazzCore/python-pdfkit Wiki · GitHub
[94796] Failed to execute script MdxConverter

wkhtmltopdf已经安装，但这个进程没法读到这个wkhtmltopdf.exe文件

hkreporter · 2020 年11 月 2 日 04:56

改这一行for line in open(name,encoding=‘utf-8’).readlines():
再运行出现：F:\Programs\Mdx\Converter
λ MdxConverter.py 1.mdx 2.txt 4.html
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 3, in
import mdict_query
ModuleNotFoundError: No module named ‘mdict_query’

F:\Programs\Mdx\Converter
λ MdxConverter.py 1.mdx 2.txt 4.html
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 3, in
import mdict_query
File “F:\Programs\Mdx\Converter\MdxConverter-master\mdict_query.py”, line 13, in
from .readmdict import MDD, MDX
ImportError: attempted relative import with no known parent package

jns · 2020 年11 月 2 日 04:59

问题定位到C:\Users\jiang\AppData\Roaming\Python\Python37\site-packages\pdfkit\configuration.py，第14行。原因是环境变量里有多个wkhtmltopdf.exe的路径，问题见

github.com/JazzCore/python-pdfkit

Fix configurator when where returns multiple lines

JazzCore:master ← rasa:patch-1

opened 01:19AM - 02 Jun 19 UTC

rasa

+4 -1

Fixes: ``` 18:13:46.171 CRITI Uncaught exception: 18:13:46.176 CRITI Tracebac…k (most recent call last): 18:13:46.177 CRITI File "C:\Python3\lib\site-packages\pdfkit\configuration.py", line 22, in __init__ 18:13:46.178 CRITI with open(self.wkhtmltopdf) as f: 18:13:46.179 CRITI OSError: [Errno 22] Invalid argument: b'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe\r\nC:\\ProgramData\\chocolatey\\bin\\wkhtmltopdf.exe' ```

解决办法，将pip默认安装的pdfkit的旧版本，更新到github上的新版本。
运行以下命令
pip install --upgrade git+https://github.com/JazzCore/python-pdfkit.git

jns · 2020 年11 月 2 日 05:03

下个mdict_query
git clone GitHub - mmjang/mdict-query: A python module for looking up mdict dictionary file (.mdx and .mdd).
然后将里面的文件复制到MdxConverter文件夹下