牛津高阶英汉双解词典第4版文本校改（2022-02-20更新）

mixivivo · 2022 年2 月 11 日 08:10

2022年02月11日更新

OX-7版对文本做了如下修订：

1）与可能的牛高四“yru版”mdx词头对比，增补近200个词头；

2）与endnote版牛高四的标准词头（standard标签）对比，修订了几十个原文中不准确的法文、西班牙文单词词头；

3）与txt81网站下载的牛高4网络文档比较，增补400多个本属于衍生词部分的词头（主要是以“-”开头的词缀和“the”开头的词组，这明显属于原发布者故意删除、破坏）；

4）遗漏的繁体字“摺”替换为“折”；

5）若干其他小错误顺手修正，比如Π、π。

——到此为止，我个人发现的牛高4原始txt文本存在的主要问题已经被修正了，其他毛病还有，比如音标修订的误伤或漏改，个别漏网的繁体字，表示温度、角度等的 ° 符号，非规范法语单词“fete、cafe”等，但这些比较trivial了，容留以后慢慢修改。

shaozhe · 2022 年2 月 11 日 08:40

请打包mdx,使用中更容易发现错误,以后更改再增加版本号。

mixivivo · 2022 年2 月 11 日 09:07

在github传了一份牛高4文本（ GitHub - mahavivo/OX4: Errata for OALD 4th Edition ），名字比较低调（毕竟侵犯版权不好大张旗鼓）。因为此后改动的东西可能不多，用git追踪版本比较方便，我将主要用github进行日常修订，积累一定erratum后再在这里更新。

mixivivo · 2022 年2 月 11 日 09:14

我主要感兴趣的是原始数据，txt做成mdx后要两头更新，还要照顾显示格式上的错误，很烦人的。其他人有兴趣可以制作mdx，不是我的趣味和对此词典的使用方法。

Fince · 2022 年2 月 11 日 10:38

不知道增补本里的新词补编部分有没有文本？如果没有的话，一共99页的补编词汇，大家一起帮忙做也许也能做完只是一个想法

Fince · 2022 年2 月 11 日 11:56

做了半页的补编，因为没经验所以效率很慢
OX4补编.txt (3.7 KB)
顺便发现一个错误，“fossil”一词的第一个音标应该是fɒsl，文本版写成了fɔsl

mixivivo · 2022 年2 月 11 日 12:42

fossil的音标在原始光盘里可能就是错的，各mdx版本也如此，纸本无误。

添加“新词补编”我个人认为没有太大价值，1）它不是第四版的内容，翻译也另有其人，不是李北达主编主译的作品，2）这些所谓的新词徒增负担，不属于核心词汇，对外语学习者的作用不是很大，3）如果确实对新词感兴趣，有OALD的8、9、10版，ODE等，不必去求助OALD4。其他则是技术上的困难，OCR中英文混合的复杂文本是很麻烦的，错误一大堆要校对，而且，现在也没有清晰的符合OCR条件的扫描图像，强行OCR，只是自寻苦恼而已。

Fince · 2022 年2 月 11 日 12:45

原来补编内容和原版负责人不同啊，那确实没什么意义了，这些补编词汇其他很多词典也有

mixivivo · 2022 年2 月 11 日 17:40

稍微了解了一下，fossil的音标写为fɔsl也不能完全说是错误，因为DJ音标不同版本用的字符不一样，见 DJ音标,DJ音标表，各版本DJ音标对照表_英语音标表图片版_巴士英语网

另外，网上流传的那份金山音标转换映射表（转换音标字体的音标）：

5=ˈ
7=ˌ
9=ˌ
A=æ
B=ɑ
C=ɔ
E=ə
F=ʃ
I=ɪ
J=ʊ
N=ŋ
Q=ʌ
R=ɔ
T=ð
U=u
V=ʒ
W=θ
\\=ɜ
^=ɡ

好像有毛病，起码在牛高4中ʊ、u弄反了，把C和R都转换为 ɔ 肯定也不妥。网上瞎搞的那些玩意看来不能轻信，害人不浅，我自己想一下怎么修复这个问题。

Fince · 2022 年2 月 11 日 18:05

原来dj音标有这么多版没太了解过

mixivivo · 2022 年2 月 12 日 12:54

修正牛高4的金山音标字体，我原来用的script是：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
OALD4原始文档音标使用了"Kingsoft Phonetic Plain"字体，
导致不安装该字体的电脑会出现乱码，在此批量替换修正。
金山词霸音标字体编码表可参见 http://www.fmddlmyy.cn/text66.html

"""

import re


def converter(match):
    phonetic_string = match.group()
    correct_symbol = phonetic_string.replace('5', 'ˈ')\
        .replace('7', 'ˌ').replace('9', 'ˌ')\
        .replace('A', 'æ').replace('B', 'ɑ')\
        .replace('C', 'ɔ').replace('E', 'ə')\
        .replace('F', 'ʃ').replace('I', 'ɪ')\
        .replace('J', 'ʊ').replace('N', 'ŋ')\
        .replace('Q', 'ʌ').replace('R', 'ɔ')\
        .replace('T', 'ð').replace('U', 'u')\
        .replace('V', 'ʒ').replace('W', 'θ')\
        .replace('Z', 'ɛ').replace('\\', 'ɜ')\
        .replace('^', 'ɡ').replace(':', 'ː')\
        .replace('[', 'ɜːr').replace('L', 'ər')\
        .replace('?@', 'US').replace('`', 'ˈ')

    return correct_symbol


def main():
    file_src = r'C:\Users\xxx\Desktop\oald.txt'
    file_dst = r'C:\Users\xxx\Desktop\oald-2.txt'

    with open(file_src, 'r', encoding='UTF-8') as f:
        text = f.read()

        p = re.compile('/.*?; .*?/ ')      # 建议先用“/ .{1,60}?; .{1,40}?/ ”等，分步修改

        result = re.sub(p, converter, text)

    with open(file_dst, 'w', encoding='UTF-8') as fout:
        fout.write(result)


if __name__ == '__main__':
    main()

现在看来这个映射表用于牛高4有问题，把 “C” 和 “R”都转成了 “ɔ”，“J” 和 “U” 的转换搞反了，“[” 转成 “ɜːr”、“L” 转成 “ər” 也不规范，不是KK音标使用的字符，应该分别为 “ɝ” 和 “ɚ” 。修改后的转换函数应为：

def converter(match):
    phonetic_string = match.group()
    correct_symbol = phonetic_string.replace('5', 'ˈ')\
        .replace('7', 'ˌ').replace('9', 'ˌ')\
        .replace('A', 'æ').replace('B', 'ɑ')\
        .replace('C', 'ɒ').replace('E', 'ə')\
        .replace('F', 'ʃ').replace('I', 'ɪ')\
        .replace('J', 'u').replace('N', 'ŋ')\
        .replace('Q', 'ʌ').replace('R', 'ɔ')\
        .replace('T', 'ð').replace('U', 'ʊ')\
        .replace('V', 'ʒ').replace('W', 'θ')\
        .replace('Z', 'ɛ').replace('\\', 'ɜ')\
        .replace('^', 'ɡ').replace(':', 'ː')\
        .replace('[', 'ɝ').replace('L', 'ɚ')\
        .replace('?@', 'US').replace('`', 'ˈ')

    return correct_symbol

但这个修正后的script是无法直接用于修复OX-7中已有的失误的，用简单替换的方法也不行，因为音标字符再映射回去存在一对多的关系。

我想出来的解决办法是重新生成一份正确的音标字典（dict），然后对OX-7中单词的音标查表替换，音标字典的原始形式略如下所示：

★zero | /ˈzɪərəʊ; ˈzɪro/
★zest | /zest; zɛst/
★zestful | /-fʊl; -fəl/
★zestfully | /-fʊlɪ; -fəlɪ/
★zigzag | /ˈzɪgzæg; ˈzɪɡzæɡ/
★zillion | /ˈzɪlɪən; ˈzɪljən/
★zinc | /zɪŋk; zɪŋk/
★zing | /zɪŋ; zɪŋ/
★Zion | /ˈzaɪən; ˈzaɪən/
★Zionism | /ˈzaɪənɪzəm; ˈzaɪənˌɪzəm/
★Zionist | /ˈzaɪənɪst; ˈzaɪənɪst/
★zip | /zɪp; zɪp/
★Zip code | /ˈzɪp kəʊd; ˈzɪpˌkod/
★zircon | /ˈzɜːkɒn; ˈzɝˌkɑn/
★zither | /ˈzɪðə(r); ˈzɪðɚ/
★zodiac | /ˈzəʊdɪæk; ˈzodɪˌæk/
★zodiacal | /zəʊˈdaɪəkl; zoˈdaɪəkl/
★zombie | /ˈzɒmbɪ; ˈzɑmbɪ/
★zone | /zəʊn; zon/
★zonal | /ˈzəʊnl; ˈzonl/
★zonked | /zɒŋkt; zɑŋkt/
★zoo | /zuː; zu/
★zoology | /zəʊˈɒlədʒɪ; zoˈɑlədʒɪ/
★zoological | /ˌzəʊəˈlɒdʒɪkl; ˌzoəˈlɑdʒɪkl/
★zoologically | /-klɪ; -klɪ/
★zoologist | /zəʊˈɒlədʒɪst; zoˈɑlədʒɪst/
★zoom | /zuːm; zum/
★zoophyte | /ˈzəʊəfaɪt; ˈzoəˌfaɪt/
★zucchini | /zʊˈkiːnɪ; zuˈkinɪ/
★Zulu | /ˈzuːluː; ˈzulu/

查表替换则用如下代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import re


def converter(match):
    ph_dict ={}

    file_erratum = r'C:\Users\xxx\Desktop\ph.txt'
    with open(file_erratum, 'r', encoding='UTF-8') as fe:
        text_list = fe.readlines()
        for row in text_list:
            ph_dict[row.split('|')[0]] = row.split('|')[1]

    headword = match.group(1)
    phonetic = match.group(2)

    if headword in ph_dict:
        correct_symbol = headword + '\n' + ph_dict[headword].strip() + ' '
        print(correct_symbol)

    else:
        correct_symbol = match.group()

    return correct_symbol


def main():
    file_src = r'C:\Users\xxx\Desktop\OX-7.txt'
    file_dst = r'C:\Users\xxx\Desktop\OX-8.txt'

    with open(file_src, 'r', encoding='UTF-8') as f:
        text = f.read()

        p = re.compile('(★.*?)\n(/.{1,60}?; .{1,40}?/ )')

        result = re.sub(p, converter, text)

    with open(file_dst, 'w', encoding='UTF-8') as fo:
        fo.write(result)


if __name__ == '__main__':
    main()

程序本身workable，修正了绝大多数不规范的音标，是否高效再说了。现在的问题是，如何评估这种批量修改造成的潜在误伤？针对这种没有形式化规范过的纯文本数据，正则/程序批量修改我一直是比较忌讳的，尽量少用，但数据量较大，手工一个个修正也不太现实，看来只能折中选择。

用正则“/.{1,60}?; .{1,40}?/”搜索OX-7，返回29177个结果，用“/.{1,60}?; .{1,40}?/ ”（最后加了空格），返回27394个结果，用“^/.{1,60}?; .{1,40}?/ ” （指定以“/”开头），返回26935个结果。说明了牛高4中大概有28000左右个音标，上列“查表替换”script选择最严格的正则表达式，只会修改26935个音标，剩余的1000、2000个音标，只能人工查核纠正了。

mixivivo · 2022 年2 月 12 日 13:11

这不是说剩下的1000、2000个音标就肯定有问题，这些音标中只有DJ音标部分包含“ɔ、u”（原始形式“C”、“J”），KK音标中包含“ʊ、ɜːr、ər”（原始形式“U”、“[”、“L”）的有毛病，需要进一步人工修正。

匿名1217 · 2022 年2 月 12 日 13:28

针对原始文本可以提供一个思路。先为音标部分加入html标签，用Kingsoft Phonetic渲染音标，保存浏览器中的结果即可，方便快速，100%准确。

mixivivo · 2022 年2 月 12 日 13:44

你说的这种办法保存出来的结果是错的，依然是“乱码”，自己试试就知道了，私码修改要用（正确的）映射表替换的方法，浏览器只保证渲染效果，不会改变原始数据。而且，给原始文本音标部分加标签工作原理和用正则、程序直接修改音标没什么区别，同样各种误伤与漏改。

mixivivo · 2022 年2 月 12 日 14:18

还有，牛高4用金山音标字体渲染出来的音标本身有些就是错误的，或者说，不规范，不符合牛高4纸本书的音标标注，像“watch”，金山字体渲染出来的音标是 /wɔtʃ; wɑtʃ/ ，实际上纸版书的音标是 /wɒtʃ; wɑtʃ/ ，“put”金山音标是 /put; pʊt/ ，纸本书上是 /pʊt; put/ ，何以至此，我也不清楚，我校改其文本的原则是尽量符合纸本书原文。

mixivivo · 2022 年2 月 12 日 15:05

此处的“查表替换”程序存在着一个陷阱，就是英文单词有一词多音现象，python词典（dict）里键值唯一，意味着最终会导致多音词都被修改为同一个靠后的发音，幸好英文里多音词比较少，后面可以人工纠正。

mixivivo · 2022 年2 月 13 日 10:40

顺手统计了一下，牛高四中的多音词比我设想的多。

abstract => 3 | essay => 3 | for => 3 | outside => 3

absent => 2 | abuse => 2 | accent => 2 | advocate => 2 | affiliate => 2 | affix => 2 | agglomerate => 2 | aggregate => 2 | alloy => 2 | ally => 2 | alternate => 2 | animate => 2 | appropriate => 2 | approximate => 2 | articulate => 2 | aspirate => 2 | associate => 2 | attribute => 2 | bass => 2 | bayonet => 2 | belay => 2 | bow => 2 | buffet => 2 | bully => 2 | but => 2 | can => 2 | char => 2 | chink => 2 | cleanly => 2 | clerk => 2 | close => 2 | co-ordinate, coordinate => 2 | co-star => 2 | collect => 2 | combine => 2 | commune => 2 | compact => 2 | complement => 2 | complex => 2 | compound => 2 | compress => 2 | condition => 2 | conduct => 2 | confederate => 2 | conflict => 2 | conjure => 2 | conscript => 2 | conserve => 2 | console => 2 | consort => 2 | consummate => 2 | contact => 2 | content => 2 | contract => 2 | contrary => 2 | contrast => 2 | converse => 2 | convert => 2 | convict => 2 | convoy => 2 | course => 2 | covert => 2 | decoy => 2 | decrease => 2 | defect => 2 | defile => 2 | degenerate => 2 | delegate => 2 | deliberate => 2 | derby => 2 | desert => 2 | designate => 2 | desolate => 2 | diffuse => 2 | digest => 2 | discard => 2 | discharge => 2 | discipline => 2 | discount => 2 | discourse => 2 | document => 2 | duplicate => 2 | elaborate => 2 | entrance => 2 | escort => 2 | estimate => 2 | excess => 2 | excise => 2 | excuse => 2 | expatriate => 2 | exploit => 2 | export => 2 | expose => 2 | extract => 2 | ferment => 2 | finger => 2 | fob => 2 | forearm => 2 | forte => 2 | fragment => 2 | frequent => 2 | gallant => 2 | gill => 2 | graduate => 2 | grave => 2 | have => 2 | house => 2 | hydrate => 2 | impact => 2 | implant => 2 | implement => 2 | import => 2 | impress => 2 | imprint => 2 | incarnate => 2 | incense => 2 | incline => 2 | incorporate => 2 | increase => 2 | indent => 2 | initiate => 2 | inland => 2 | inlay => 2 | insert => 2 | insult => 2 | interchange => 2 | interdict => 2 | interlock => 2 | intimate => 2 | intrigue => 2 | invalid => 2 | inverse => 2 | invite => 2 | mare => 2 | minute => 2 | misconduct => 2 | miscount => 2 | mishit => 2 | mismatch => 2 | misprint => 2 | misuse => 2 | moderate => 2 | mouth => 2 | must => 2 | noodle => 2 | object => 2 | offset => 2 | orient => 2 | ornament => 2 | overall => 2 | overbid => 2 | overflow => 2 | overhang => 2 | overhaul => 2 | overhead => 2 | overlap => 2 | overlay => 2 | overnight => 2 | overprint => 2 | overthrow => 2 | overwork => 2 | pace => 2 | pasty => 2 | pate => 2 | patent => 2 | pedal => 2 | pension => 2 | perfect => 2 | perfume => 2 | permit => 2 | pervert => 2 | piano => 2 | pommel => 2 | pontificate => 2 | postulate => 2 | precipitate => 2 | predicate => 2 | prefix => 2 | presage => 2 | present => 2 | primer => 2 | process => 2 | produce => 2 | progress => 2 | project => 2 | prolapse => 2 | prospect => 2 | prostrate => 2 | protest => 2 | purport => 2 | quadruple => 2 | re-count => 2 | read => 2 | rebel => 2 | rebound => 2 | recall => 2 | recap => 2 | recoil => 2 | record => 2 | refill => 2 | refit => 2 | refund => 2 | refuse => 2 | regenerate => 2 | regiment => 2 | rehash => 2 | reincarnate => 2 | reject => 2 | rejoin => 2 | relay => 2 | remake => 2 | remount => 2 | replay => 2 | represent => 2 | reprint => 2 | rerun => 2 | research => 2 | resit => 2 | resume => 2 | retake => 2 | rethink => 2 | retread => 2 | reuse => 2 | rewrite => 2 | rose => 2 | row => 2 | scarify => 2 | second => 2 | segment => 2 | separate => 2 | slough => 2 | some => 2 | sou => 2 | sow => 2 | subcontract => 2 | subject => 2 | subordinate => 2 | substantive => 2 | supplement => 2 | surcharge => 2 | surmise => 2 | survey => 2 | suspect => 2 | syndicate => 2 | tarry => 2 | tear => 2 | that => 2 | there => 2 | torment => 2 | transfer => 2 | transplant => 2 | transport => 2 | triplicate => 2 | undercut => 2 | underestimate => 2 | underground => 2 | undertaking => 2 | unused => 2 | update => 2 | upgrade => 2 | uplift => 2 | use => 2 | used => 2 | valet => 2 | viola => 2 | wash => 2 | wind => 2 | woof => 2 | work => 2 | ye => 2

mixivivo · 2022 年2 月 14 日 01:32

牛高四文本真是一个天坑，本来我认为可能没啥大毛病了，没想到接连暴露出两个问题：一是音标转写错误，或者说不规范，不符合原纸本书，原来的“乱码”文本即使使用金山音标字体，渲染出来的音标字符有些也是错的；二则是词条重复，查了一下，计有如下283个重复entry：

accomplish | accomplished | according | accordingly | amino acid | ascorbic acid | addition | additional | additionally | admittance | advisable | advisability | advocacy | Afrikaans | amortize, amortise | amortization, -isation | analogous | analogously | -ance, -ence | appetizer, appetiser | appetizing, appetising | appetizingly, -isingly | BA | Bailey bridge | Belisha beacon | begot, begotten | bo’sn, bos’n | bound | canonize, canonise | catechize, catechise | certitude | chanty, chantey | characterization, -isation | chose, chosen | climate | climatic | climatically | climatology | coop | coop | co-operate, cooperate | co-operator | co-operation, cooperation | co-operative, cooperative | co-operative | co-operatively | co-opt, coopt | co-ordinate clause | co-ordinate, coordinate | co-ordination | co-ordinator | co-pilot, copilot | co-religionist, coreligionist | co-respondent, corespondent | co-signatory, cosignatory | cohere | comedy | common | commonly | common decency | criticize, criticise | crumb | cycle | cycle | daylight | dog days | dig | dig | disorganize, disorganise | dissolution | do’s and don’ts | don’t | down draught (US down draft) | downtrodden | dump | dump | dumper | economic | electron | eloquence | en passant | en route | eulogize, eulogise | eulogist | eulogistic | extra | extra | extra | extra- | femme fatale | foot | foot-and-mouth (disease) | football | foot-bridge | footfall | foot-fault | foothill | foothold | footlights | footloose | footman | footmark | footnote | footpath | footplate | footprint | foot-slog | footsore | footstep | footstool (also stool) | footway | footwear | footwork | foot | -footed | fertilizer, fertiliser | fib | flick | flick | fly | fly-blown | flycatcher | fly-fish | fly-fishing | fly-paper | fly-spray | flyweight | fly | fly-away | fly-by | fly-by-night | fly-half | fly-past | fly | granary | Greenwich Mean Time | handicraft | hers | Hon Sec | humanize, humanise | humanization, humanisation | hyp-, hypo- | instead | knavish | knavishly | lapis lazuli | leaf | leaf | leafage | leafless | leafy | leaf-mould | light | light-coloured | light bulb | light meter | light pen | light-year | LL B, LL D, LL M | lunar | man | man | man-at-arms | man-eater | man Friday | man-hour | man-hunt | man of letters, woman of letters | man-made | man-of-war | manservant | man-size (also man-sized) | manslaughter | mantrap | man | man | Manx | mar | Mar | matt, mat (US also matte) | matt, mat (US also matte) | mechanic | minimize, minimise | mollycoddle | niggle | niggling | oct(o)- | orth(o)- | oste(o)- | out of | organization, organisation | organizational, organisational | outhouse | pantingly | ped-, pedo- | per pro | philosophy | pictorial | pictorial | pictorially | politic | pre- | probable | probable | probably | procedure | procedural | psych(o)-, psycho- | quid pro quo | quit | quitter | rapist | recount | recover | recoverable | reform | reform | reformer | relay | relay | recusant | richly | richness | rusk | Scot | set square | shirt | shirting | shirt-front | shirt-sleeve | shirt-tail | shirtwaist | shit | sinner | spec | stipple | tap | tap | tap-root | tap-water | tap | tap | tooth | toothed | toothless | toothy (-ier, -iest) | toothily | toothache | toothbrush | toothpaste | tooth-powder | toothpick | tight | tight | tighten | tightly | tightness | tight-fisted | tight-lipped | timetable (also esp US schedule) | top | troglodyte | unlucky | unluckily | unlucky | unluckily | unmade | urn | vaporous | vice versa | viva voce | viva voce | vowel | weed | weed | weedy | weed-killer | Wesleyan | who’re | wife | wifely | woozy | worsted

音标不规范的问题用代码+手工替换修正，冗余重复词条则一一删除。如是则有OX-8。

shiruxue · 2022 年2 月 14 日 14:38

IPA & KK

IPA63是旧版IPA，IPA88是新版IPA

青蓝冰水青 · 2022 年2 月 15 日 07:22

牛津高阶英汉双解词典 第4版 文本校改（2022-02-20更新）

牛津高阶英汉双解词典第4版文本校改（2022-02-20更新）