求助，stardict转txt再转mdx的过程中，单词变体怎么处理

hh花花 · 2025 年6 月 30 日 04:24

6lj6 · 2025 年6 月 30 日 08:43

python 脚本和正则都能处理，问题是，你掌握的怎么样，到哪一个程度了。

这决定了回答这个问题要说多详细。

或者换个方式，可以上传下文本样本，让别人来处理吗?

hh花花 · 2025 年6 月 30 日 09:43

昨天为了用pyglossary才安装python的…txt我是拿emeditor修改的（还是靠对比mdx文件拆出来的txt格式修改的…）文本传不上来，老出错，大概是太大了，只能把词典来源放上来了https://www.reader-dict.com/en。stardict的文件拆出来的txt，加了换行和</>就是现在的文件

last_idol · 2025 年6 月 30 日 10:46

文本文件可以只取前 10 行。

hh花花 · 2025 年6 月 30 日 12:44

截了十几传行上来了，所有的基本都是这个格式。例.txt (2.4 KB)

Text

6lj6 · 2025 年6 月 30 日 13:01


def process_file(input_file, output_file):
    output = ""
    with open(input_file, 'r') as f:
        lines = f.readlines()
    
    i = 0
    while i < len(lines):
        # Read first line and split by '|'
        if i >= len(lines):
            break
        foo = lines[i].strip().split('|')
        i += 1
        
        # Read rest lines as bar until '</>'
        bar = []
        while i < len(lines) and lines[i].strip() != '</>':
            bar.append(lines[i])
            i += 1
        
        # Skip the '</>' line
        if i < len(lines) and lines[i].strip() == '</>':
            i += 1
        
        # Process each element in foo
        for element in foo:
            element = element.strip()
            if element:  # Skip empty elements
                output += element + '\n'
                output += '\n'.join(bar).strip() + '\n'
                output += '</>\n'
    
    # Save output to file
    with open(output_file, 'w') as f:
        f.write(output)

# Example usage:
process_file('input.txt', 'output.txt')

hh花花 · 2025 年7 月 1 日 03:35

反复报错UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0x80 in position 2315: illegal multibyte sequence
怎么办

last_idol · 2025 年7 月 1 日 04:28

with open(input_file, 'r') as f
修改成：
with open(input_file, 'r', encoding='utf-8') as f

hh花花 · 2025 年7 月 1 日 07:18

还是不行，会出现之前的unicode encode error，改成encoding=‘gbk’，errors= 'ignore '的话倒是可以输出，但是会出现很多乱码，encoding utf8 error ignore也是不能转换的，另外就是只能把文件切成20份才能转，整个塞进去电脑转了两个小时都不行

last_idol · 2025 年7 月 1 日 07:31

txt 文本文件的编码确保是 utf-8 无 bom，如果不确定用 emeditor 检查一下。

Sunny1 · 2025 年7 月 1 日 09:25

在emeditor原地展开应该也行

查找：

^([^<>|]+?) *\| *([^<>|]+?) *(\|.+)?$

替换为：

\2\n@@@LINK=\1\n</>\n\1\3

（重复替换直到结束）

效果：

Sunny1 · 2025 年7 月 1 日 11:20

\3 是指第三个括号的内容，改成 \9 会丢失相应的数据
不断重复“替换全部”，直到所有词头都被展开
（你可以先找几个词条做测试，看它是如何被层层展开的）

hh花花 · 2025 年7 月 1 日 11:25

后面发现丢失内容，就用3反复替换了，所以就把刚才的回复删掉了但是现在做出来的文件它都有两套翻译