求助：提取mdx词条转换为HTML/PDF

hkreporter · 2020 年11 月 2 日 05:16

实在非常感谢！
连续尝试，提示缺啥我就安装啥，接连安装到bs4和lxml，终于搞定了。。。

hkreporter · 2020 年11 月 2 日 06:25

有多个wkhtmltopdf的问题，在我把wkhtmltopdf的安装路径从user环境变量中移除之后解决了，当时安装了之后以为没加进去，就手动加了一下。

hkreporter · 2020 年11 月 2 日 12:22

λ MMdxConverter.py 1.mdx 2.txt 3.html
Lesson 1
agency
WARNING: “agency” not found
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 255, in
{
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 206, in mdx2html
right_soup = merge_css(right_soup, os.path.split(mdx_name)[0], dictionary, with_toc)
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 91, in merge_css
css = get_css(soup, mdx_path, dictionary)
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 77, in get_css
css_name = soup.head.link[“href”]
AttributeError: ‘NoneType’ object has no attribute ‘link’

又出错了，这次是link。
请教该怎么改为好？

jns · 2020 年11 月 2 日 12:44

1.mdx中没有agency这个词头
WARNING: “agency” not found

hkreporter · 2020 年11 月 2 日 12:57

换成一个存在的词头还是如此。
如果这个link要去掉该怎么改才好？
def get_css(soup, mdx_path, dictionary):

css_name = soup.head.link["href"]

css_path = os.path.join(mdx_path, css_name)

if os.path.exists(css_path):

    css = open(css_path, 'rb').read()

elif hasattr(dictionary, '_mdd_db'):

    css_key = dictionary.get_mdd_keys('*' + css_name)[0]

    css = dictionary.mdd_lookup(css_key)[0]

else:

    css = ''

return css.decode('utf-8')

jns · 2020 年11 月 2 日 13:06

你txt里的词条是什么，我这里正常，复现不出来。

hkreporter · 2020 年11 月 2 日 13:12

大致明白为什么了：每次用emeditor打开之后，编码都变成了utf-8，用记事本另存为ansi问题就解决了。

hkreporter · 2020 年11 月 2 日 13:17

swlist.txt (8.4 KB) BtD.css (5.7 KB) 1.mdx (1.5 MB)

麻烦看看这个问题在哪里：
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 221, in mdx2pdf
File “MdxConverter.py”, line 206, in mdx2html
File “MdxConverter.py”, line 91, in merge_css
File “MdxConverter.py”, line 77, in get_css
AttributeError: ‘NoneType’ object has no attribute ‘link’
[299120] Failed to execute script MdxConverter

是不是一旦出现不在mdx中的词条，就会让link出问题？
我后来用Mdict Editor Tool v2.0.36提取了一下，得到能提取到的key，直接用key里边包含的单词去提取，就不会报错。
out_keys.txt (953 字节)

jns · 2020 年11 月 2 日 13:57

3.pdf (1.0 MB) MdxConverter.py (9.2 KB)
修复乱码，原因是由于有不存在的词头导致，rightsoup的head没有被写入，因此报错。

hkreporter · 2020 年11 月 2 日 14:27

感谢，这次没有报错了。
wkhtmltopdf的问题还是没变，即便我已经把它从环境变量中去掉了。
生成的temp.html没有乱码

Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\Ref_subject_word_lists in 疑难用法手册\MdxConverter-master\MdxConverter.py”, line 25
5, in
{
File “F:\Programs\Mdx\Converter\Ref_subject_word_lists in 疑难用法手册\MdxConverter-master\MdxConverter.py”, line 22
2, in mdx2pdf
pdfkit.from_file(TEMP_FILE, output_name)
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\api.py”, line 46, in from_file
r = PDFKit(input, ‘file’, options=options, toc=toc, cover=cover, css=css,
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\pdfkit.py”, line 45, in init
self.configuration = (Configuration() if configuration is None
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\configuration.py”, line 25, in init
raise IOError(‘No wkhtmltopdf executable found: “%s”\n’
OSError: No wkhtmltopdf executable found: “b’'”
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - Installing wkhtmltopdf · JazzCore/python-pdfkit Wiki · GitHub

jns · 2020 年11 月 3 日 00:24

那就不清楚了，我这里转html和pdf都是正常的。

Vim · 2020 年11 月 3 日 01:53

您是MdxConverter的开发者吗？

请教，这个软件对mdx有什么具体要求：为什么同一个词条列表在有的mdx中提取正常，在另一个mdx却不行？

jns · 2020 年11 月 3 日 02:12

我不是作者，你说的这种情况没遇到过，不好说什么原因

Vim · 2020 年11 月 3 日 02:38

我使用您前面修订的MdxConverter.py 以及楼主的词条、css和mdx，可以正常输出HTML或PDF(书签还有点小问题)：

但使用同样的MdxConverter.py及词条，但更换mdx为 abc.mdx (1.4 MB) ，却得到如下提示，不知该如何解决：

jns · 2020 年11 月 3 日 03:09

5.pdf (680.6 KB) MdxConverter.py (9.5 KB)
5.pdf是导出结果，MdxConverter.py替换。

Vim · 2020 年11 月 3 日 03:15

太赞了，完美！

hkreporter · 2020 年11 月 3 日 03:37

转html不报错，转pdf报错：
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 266, in
{
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 233, in mdx2pdf
pdfkit.from_file(TEMP_FILE, output_name)
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\api.py”, line 46, in from_file
r = PDFKit(input, ‘file’, options=options, toc=toc, cover=cover, css=css,
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\pdfkit.py”, line 45, in init
self.configuration = (Configuration() if configuration is None
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\configuration.py”, line 22, in init
with open(self.wkhtmltopdf) as f:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd2 in position 52: invalid continuation byte

hkreporter · 2020 年11 月 3 日 03:38

x.mdx (1.5 MB) swlist 释义比例带词性.rar (35.7 KB)
这个带词性的释义比例词典不错

hkreporter · 2020 年11 月 3 日 03:50

swyc.rar (456.9 KB)
试图提取三维英词mdx，但只得到中文，您试试看

Vim · 2020 年11 月 3 日 03:50

输出PDF，我这里没有报错