Chambers 20th Century Dictionary Data Processing

lurker · 2021 年8 月 29 日 15:21

来源： Project Gutenberg

目标：从 HTML 文件提取信息，制作可查性良好的电子词典。

dcurls.txt (232 字节)

dc.7z (3.0 MB)

lurker · 2021 年8 月 31 日 15:24

extract_cha_13_alpha2.py (526 字节)

lurker · 2021 年8 月 31 日 15:34

Alpha 版只做一件事：把主词条包裹起来。测试通过了才进入下一环节。

lurker · 2021 年9 月 2 日 16:21

考虑对词头进行美式拼写“扩容”。比如 Realise >> [Realise, Realize]

一个方案：Python NLP British English vs American English - Stack Overflow

lurker · 2021 年9 月 3 日 12:40

看是否能把 Wiktionary 单词地区拼写、变体的数据提取出来。

hua · 2021 年9 月 3 日 12:44

这么扩是为啥呀，词典软件会做这件事吗？

lurker · 2021 年9 月 3 日 13:03

我经常查美式拼写的单词，纯英式词典给不出结果，得再查一遍英式拼写（还不一定能拼出来）。

更正：

可以让词典软件来做拼写扩容。优先显示所查词条，没有再看该词典词头列表内有没有其它地区的拼写，有则显示。

last_idol · 2021 年9 月 3 日 13:15

扩容的词头变形拼写，单独放一个词典里，这样所有词典都能用上，类似 The Little Dict 。

lurker · 2021 年9 月 3 日 13:25

可以。但最好是让词典软件自动跳转，用户直接看结果。跳转词条可标识一行小字：“由 … 跳至 …"，与正常词条区分。

lurker · 2021 年9 月 5 日 10:23

不打算对原始文本匹配后手工修改了，拆分词条的计划搁置，但会稍微优化下排版。计算思维。

王寒北 · 2021 年9 月 8 日 01:19

遇到点困难就放弃了？

lurker · 2021 年9 月 8 日 01:33

不困难啊，校对部分，识字的猪都能做。但没人给稿费，我不想浪费几十个工时。

lurker · 2021 年9 月 8 日 04:01

“We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.”

lurker · 2021 年9 月 14 日 04:14

插图下载：新建 /images，把 imgurls.txt 扔进去，cd 到那儿，执行：


wget -m -e robots=off -np -nd -R 'index.html*' -i imgurls.txt

imgurls.txt (216 字节)

images.7z (7.7 MB)

lurker · 2021 年9 月 20 日 02:50

extract_cha_13_beta.py (1.3 KB)

Got the main entries.

lurker · 2021 年9 月 20 日 18:26

难的是给衍生词准确地加标签。测了又测，改了又改。

lurker · 2021 年9 月 21 日 08:06

国庆再继续了。休息。

002CAAL6gy1guo297g5ipg60m80ci0z002

王寒北 · 2021 年9 月 23 日 03:48

图片有明显逻辑问题
最后月饼装盒可以全自动，难道前面的那些更简单的工序不能用机器完成？

lurker · 2021 年10 月 1 日 04:25

Working in a 10-hour coding marathon without break time.

extract_cha_13_beta6.py (7.7 KB)

lurker · 2021 年10 月 1 日 06:11

动词分词应被提取。b/i 标签套娃了。