[练习分享] Websters1913.com [20240306]

选择从网站再抓取,除了文本内容本身,还考虑到标签化、现成排版的借用。

GCIDE本身数据就开源,确实可以直接下载。但是这些源数据本身有各种奇怪的问题,我能力有限,解决不了。
比如混入了非编码文本需要手动修,之前读取错误于是我检验了一下源文件,发现B、T、U都有问题。检测代码和输出结果放后面。

websterParser项目的源码我不太读得懂,不知道怎么可以利用来转换GCIDE的源文件,再转mdx。如果有大佬懂的话,能否简单指导一下。

import string

for letter in string.ascii_uppercase:
    try:
        with open(f'gcide-0.53/CIDE.{letter}', 'r', encoding='ascii') as f:
            content = f.read()
            print(f'{letter} Done!')
    except UnicodeDecodeError as e:
        print(f'{letter} Skipped: {e}')
        continue
A Done!
B Skipped: 'ascii' codec can't decode byte 0x92 in position 1554656: ordinal not in range(128)
C Done!
D Done!
E Done!
F Done!
G Done!
H Done!
I Done!
J Done!
K Done!
L Done!
M Done!
N Done!
O Done!
P Done!
Q Done!
R Done!
S Done!
T Skipped: 'ascii' codec can't decode byte 0xe7 in position 224967: ordinal not in range(128)
U Skipped: 'ascii' codec can't decode byte 0xb9 in position 1027603: ordinal not in range(128)
V Done!
W Done!
X Done!
Y Done!
Z Done!