求助:提取mdx词条转换为HTML/PDF

实在非常感谢!
连续尝试,提示缺啥我就安装啥,接连安装到bs4和lxml,终于搞定了。。。

1 个赞

有多个wkhtmltopdf的问题,在我把wkhtmltopdf的安装路径从user环境变量中移除之后解决了,当时安装了之后以为没加进去,就手动加了一下。

λ MMdxConverter.py 1.mdx 2.txt 3.html
Lesson 1
agency
WARNING: “agency” not found
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 255, in
{
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 206, in mdx2html
right_soup = merge_css(right_soup, os.path.split(mdx_name)[0], dictionary, with_toc)
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 91, in merge_css
css = get_css(soup, mdx_path, dictionary)
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 77, in get_css
css_name = soup.head.link[“href”]
AttributeError: ‘NoneType’ object has no attribute ‘link’

又出错了,这次是link。
请教该怎么改为好?

1.mdx中没有agency这个词头
WARNING: “agency” not found

换成一个存在的词头还是如此。
如果这个link要去掉该怎么改才好?
def get_css(soup, mdx_path, dictionary):

css_name = soup.head.link["href"]

css_path = os.path.join(mdx_path, css_name)

if os.path.exists(css_path):

    css = open(css_path, 'rb').read()

elif hasattr(dictionary, '_mdd_db'):

    css_key = dictionary.get_mdd_keys('*' + css_name)[0]

    css = dictionary.mdd_lookup(css_key)[0]

else:

    css = ''

return css.decode('utf-8')

你txt里的词条是什么,我这里正常,复现不出来。

大致明白为什么了:每次用emeditor打开之后,编码都变成了utf-8,用记事本另存为ansi问题就解决了。

swlist.txt (8.4 KB) BtD.css (5.7 KB) 1.mdx (1.5 MB)

麻烦看看这个问题在哪里:
Traceback (most recent call last):
File “MdxConverter.py”, line 258, in
File “MdxConverter.py”, line 221, in mdx2pdf
File “MdxConverter.py”, line 206, in mdx2html
File “MdxConverter.py”, line 91, in merge_css
File “MdxConverter.py”, line 77, in get_css
AttributeError: ‘NoneType’ object has no attribute ‘link’
[299120] Failed to execute script MdxConverter

是不是一旦出现不在mdx中的词条,就会让link出问题?
我后来用Mdict Editor Tool v2.0.36提取了一下,得到能提取到的key,直接用key里边包含的单词去提取,就不会报错。
out_keys.txt (953 字节)

3.pdf (1.0 MB) MdxConverter.py (9.2 KB)
修复乱码,原因是由于有不存在的词头导致,rightsoup的head没有被写入,因此报错。

1 个赞

感谢,这次没有报错了。
wkhtmltopdf的问题还是没变,即便我已经把它从环境变量中去掉了。
生成的temp.html没有乱码

Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\Ref_subject_word_lists in 疑难用法手册\MdxConverter-master\MdxConverter.py”, line 25
5, in
{
File “F:\Programs\Mdx\Converter\Ref_subject_word_lists in 疑难用法手册\MdxConverter-master\MdxConverter.py”, line 22
2, in mdx2pdf
pdfkit.from_file(TEMP_FILE, output_name)
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\api.py”, line 46, in from_file
r = PDFKit(input, ‘file’, options=options, toc=toc, cover=cover, css=css,
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\pdfkit.py”, line 45, in init
self.configuration = (Configuration() if configuration is None
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\configuration.py”, line 25, in init
raise IOError(‘No wkhtmltopdf executable found: “%s”\n’
OSError: No wkhtmltopdf executable found: “b’'”
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - Installing wkhtmltopdf · JazzCore/python-pdfkit Wiki · GitHub

那就不清楚了,我这里转html和pdf都是正常的。

您是MdxConverter的开发者吗?

请教,这个软件对mdx有什么具体要求:为什么同一个词条列表在有的mdx中提取正常,在另一个mdx却不行?

我不是作者,你说的这种情况没遇到过,不好说什么原因

我使用您前面修订的MdxConverter.py 以及楼主的词条、css和mdx,可以正常输出HTML或PDF(书签还有点小问题):

但使用同样的MdxConverter.py及词条,但更换mdx为 abc.mdx (1.4 MB) ,却得到如下提示,不知该如何解决:

5.pdf (680.6 KB) MdxConverter.py (9.5 KB)
5.pdf是导出结果,MdxConverter.py替换。

2 个赞

太赞了,完美!

转html不报错,转pdf报错:
Traceback (most recent call last):
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 266, in
{
File “F:\Programs\Mdx\Converter\MdxConverter-master\MdxConverter.py”, line 233, in mdx2pdf
pdfkit.from_file(TEMP_FILE, output_name)
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\api.py”, line 46, in from_file
r = PDFKit(input, ‘file’, options=options, toc=toc, cover=cover, css=css,
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\pdfkit.py”, line 45, in init
self.configuration = (Configuration() if configuration is None
File “C:\Program Files\Python3.8\lib\site-packages\pdfkit\configuration.py”, line 22, in init
with open(self.wkhtmltopdf) as f:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd2 in position 52: invalid continuation byte

x.mdx (1.5 MB) swlist 释义比例 带词性.rar (35.7 KB)
这个带词性的释义比例词典不错

1 个赞

swyc.rar (456.9 KB)
试图提取三维英词mdx,但只得到中文,您试试看

输出PDF,我这里没有报错

2 个赞