牛津高阶英汉双解词典 第4版 文本校改(2022-02-20更新)

2022年02月11日更新

OX-7版对文本做了如下修订:

1)与可能的牛高四“yru版”mdx词头对比,增补近200个词头;

2)与endnote版牛高四的标准词头(standard标签)对比,修订了几十个原文中不准确的法文、西班牙文单词词头;

3)与txt81网站下载的牛高4网络文档比较,增补400多个本属于衍生词部分的词头(主要是以“-”开头的词缀和“the”开头的词组,这明显属于原发布者故意删除、破坏);

4)遗漏的繁体字“摺”替换为“折”;

5)若干其他小错误顺手修正,比如Π、π。

——到此为止,我个人发现的牛高4原始txt文本存在的主要问题已经被修正了,其他毛病还有,比如音标修订的误伤或漏改,个别漏网的繁体字,表示温度、角度等的 ° 符号,非规范法语单词“fete、cafe”等,但这些比较trivial了,容留以后慢慢修改。

1 个赞

请打包mdx,使用中更容易发现错误,以后更改再增加版本号。

在github传了一份牛高4文本( GitHub - mahavivo/OX4: Errata for OALD 4th Edition ),名字比较低调(毕竟侵犯版权不好大张旗鼓)。因为此后改动的东西可能不多,用git追踪版本比较方便,我将主要用github进行日常修订,积累一定erratum后再在这里更新。

1 个赞

我主要感兴趣的是原始数据,txt做成mdx后要两头更新,还要照顾显示格式上的错误,很烦人的。其他人有兴趣可以制作mdx,不是我的趣味和对此词典的使用方法。

1 个赞

不知道增补本里的新词补编部分有没有文本?如果没有的话,一共99页的补编词汇,大家一起帮忙做也许也能做完 :grinning:只是一个想法

做了半页的补编,因为没经验所以效率很慢
OX4补编.txt (3.7 KB)
顺便发现一个错误,“fossil”一词的第一个音标应该是fɒsl,文本版写成了fɔsl

1 个赞

fossil的音标在原始光盘里可能就是错的,各mdx版本也如此,纸本无误。

添加“新词补编”我个人认为没有太大价值,1)它不是第四版的内容,翻译也另有其人,不是李北达主编主译的作品,2)这些所谓的新词徒增负担,不属于核心词汇,对外语学习者的作用不是很大,3)如果确实对新词感兴趣,有OALD的8、9、10版,ODE等,不必去求助OALD4。其他则是技术上的困难,OCR中英文混合的复杂文本是很麻烦的,错误一大堆要校对,而且,现在也没有清晰的符合OCR条件的扫描图像,强行OCR,只是自寻苦恼而已。

1 个赞

原来补编内容和原版负责人不同啊,那确实没什么意义了,这些补编词汇其他很多词典也有

稍微了解了一下,fossil的音标写为fɔsl也不能完全说是错误,因为DJ音标不同版本用的字符不一样,见 DJ音标,DJ音标表,各版本DJ音标对照表_英语音标表图片版_巴士英语网

另外,网上流传的那份金山音标转换映射表( 转换音标字体的音标 ):

5=ˈ
7=ˌ
9=ˌ
A=æ
B=ɑ
C=ɔ
E=ə
F=ʃ
I=ɪ
J=ʊ
N=ŋ
Q=ʌ
R=ɔ
T=ð
U=u
V=ʒ
W=θ
\\=ɜ
^=ɡ

好像有毛病,起码在牛高4中ʊ、u弄反了,把C和R都转换为 ɔ 肯定也不妥。网上瞎搞的那些玩意看来不能轻信,害人不浅,我自己想一下怎么修复这个问题。

原来dj音标有这么多版:joy::joy:没太了解过

修正牛高4的金山音标字体,我原来用的script是:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
OALD4原始文档音标使用了"Kingsoft Phonetic Plain"字体,
导致不安装该字体的电脑会出现乱码,在此批量替换修正。
金山词霸音标字体编码表可参见 http://www.fmddlmyy.cn/text66.html

"""

import re


def converter(match):
    phonetic_string = match.group()
    correct_symbol = phonetic_string.replace('5', 'ˈ')\
        .replace('7', 'ˌ').replace('9', 'ˌ')\
        .replace('A', 'æ').replace('B', 'ɑ')\
        .replace('C', 'ɔ').replace('E', 'ə')\
        .replace('F', 'ʃ').replace('I', 'ɪ')\
        .replace('J', 'ʊ').replace('N', 'ŋ')\
        .replace('Q', 'ʌ').replace('R', 'ɔ')\
        .replace('T', 'ð').replace('U', 'u')\
        .replace('V', 'ʒ').replace('W', 'θ')\
        .replace('Z', 'ɛ').replace('\\', 'ɜ')\
        .replace('^', 'ɡ').replace(':', 'ː')\
        .replace('[', 'ɜːr').replace('L', 'ər')\
        .replace('?@', 'US').replace('`', 'ˈ')

    return correct_symbol


def main():
    file_src = r'C:\Users\xxx\Desktop\oald.txt'
    file_dst = r'C:\Users\xxx\Desktop\oald-2.txt'

    with open(file_src, 'r', encoding='UTF-8') as f:
        text = f.read()

        p = re.compile('/.*?; .*?/ ')      # 建议先用“/ .{1,60}?; .{1,40}?/ ”等,分步修改

        result = re.sub(p, converter, text)

    with open(file_dst, 'w', encoding='UTF-8') as fout:
        fout.write(result)


if __name__ == '__main__':
    main()

现在看来这个映射表用于牛高4有问题,把 “C” 和 “R”都转成了 “ɔ”,“J” 和 “U” 的转换搞反了,“[” 转成 “ɜːr”、“L” 转成 “ər” 也不规范,不是KK音标使用的字符,应该分别为 “ɝ” 和 “ɚ” 。修改后的转换函数应为:

def converter(match):
    phonetic_string = match.group()
    correct_symbol = phonetic_string.replace('5', 'ˈ')\
        .replace('7', 'ˌ').replace('9', 'ˌ')\
        .replace('A', 'æ').replace('B', 'ɑ')\
        .replace('C', 'ɒ').replace('E', 'ə')\
        .replace('F', 'ʃ').replace('I', 'ɪ')\
        .replace('J', 'u').replace('N', 'ŋ')\
        .replace('Q', 'ʌ').replace('R', 'ɔ')\
        .replace('T', 'ð').replace('U', 'ʊ')\
        .replace('V', 'ʒ').replace('W', 'θ')\
        .replace('Z', 'ɛ').replace('\\', 'ɜ')\
        .replace('^', 'ɡ').replace(':', 'ː')\
        .replace('[', 'ɝ').replace('L', 'ɚ')\
        .replace('?@', 'US').replace('`', 'ˈ')

    return correct_symbol

但这个修正后的script是无法直接用于修复OX-7中已有的失误的,用简单替换的方法也不行,因为音标字符再映射回去存在一对多的关系。

我想出来的解决办法是重新生成一份正确的音标字典(dict),然后对OX-7中单词的音标查表替换,音标字典的原始形式略如下所示:

★zero | /ˈzɪərəʊ; ˈzɪro/
★zest | /zest; zɛst/
★zestful | /-fʊl; -fəl/
★zestfully | /-fʊlɪ; -fəlɪ/
★zigzag | /ˈzɪgzæg; ˈzɪɡzæɡ/
★zillion | /ˈzɪlɪən; ˈzɪljən/
★zinc | /zɪŋk; zɪŋk/
★zing | /zɪŋ; zɪŋ/
★Zion | /ˈzaɪən; ˈzaɪən/
★Zionism | /ˈzaɪənɪzəm; ˈzaɪənˌɪzəm/
★Zionist | /ˈzaɪənɪst; ˈzaɪənɪst/
★zip | /zɪp; zɪp/
★Zip code | /ˈzɪp kəʊd; ˈzɪpˌkod/
★zircon | /ˈzɜːkɒn; ˈzɝˌkɑn/
★zither | /ˈzɪðə(r); ˈzɪðɚ/
★zodiac | /ˈzəʊdɪæk; ˈzodɪˌæk/
★zodiacal | /zəʊˈdaɪəkl; zoˈdaɪəkl/
★zombie | /ˈzɒmbɪ; ˈzɑmbɪ/
★zone | /zəʊn; zon/
★zonal | /ˈzəʊnl; ˈzonl/
★zonked | /zɒŋkt; zɑŋkt/
★zoo | /zuː; zu/
★zoology | /zəʊˈɒlədʒɪ; zoˈɑlədʒɪ/
★zoological | /ˌzəʊəˈlɒdʒɪkl; ˌzoəˈlɑdʒɪkl/
★zoologically | /-klɪ; -klɪ/
★zoologist | /zəʊˈɒlədʒɪst; zoˈɑlədʒɪst/
★zoom | /zuːm; zum/
★zoophyte | /ˈzəʊəfaɪt; ˈzoəˌfaɪt/
★zucchini | /zʊˈkiːnɪ; zuˈkinɪ/
★Zulu | /ˈzuːluː; ˈzulu/

查表替换则用如下代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import re


def converter(match):
    ph_dict ={}

    file_erratum = r'C:\Users\xxx\Desktop\ph.txt'
    with open(file_erratum, 'r', encoding='UTF-8') as fe:
        text_list = fe.readlines()
        for row in text_list:
            ph_dict[row.split('|')[0]] = row.split('|')[1]

    headword = match.group(1)
    phonetic = match.group(2)

    if headword in ph_dict:
        correct_symbol = headword + '\n' + ph_dict[headword].strip() + ' '
        print(correct_symbol)

    else:
        correct_symbol = match.group()

    return correct_symbol


def main():
    file_src = r'C:\Users\xxx\Desktop\OX-7.txt'
    file_dst = r'C:\Users\xxx\Desktop\OX-8.txt'

    with open(file_src, 'r', encoding='UTF-8') as f:
        text = f.read()

        p = re.compile('(★.*?)\n(/.{1,60}?; .{1,40}?/ )')

        result = re.sub(p, converter, text)

    with open(file_dst, 'w', encoding='UTF-8') as fo:
        fo.write(result)


if __name__ == '__main__':
    main()

程序本身workable,修正了绝大多数不规范的音标,是否高效再说了。现在的问题是,如何评估这种批量修改造成的潜在误伤?针对这种没有形式化规范过的纯文本数据,正则/程序批量修改我一直是比较忌讳的,尽量少用,但数据量较大,手工一个个修正也不太现实,看来只能折中选择。

用正则“/.{1,60}?; .{1,40}?/”搜索OX-7,返回29177个结果,用“/.{1,60}?; .{1,40}?/ ”(最后加了空格),返回27394个结果,用“^/.{1,60}?; .{1,40}?/ ” (指定以“/”开头),返回26935个结果。说明了牛高4中大概有28000左右个音标,上列“查表替换”script选择最严格的正则表达式,只会修改26935个音标,剩余的1000、2000个音标,只能人工查核纠正了。

这不是说剩下的1000、2000个音标就肯定有问题,这些音标中只有DJ音标部分包含“ɔ、u”(原始形式“C”、“J”),KK音标中包含“ʊ、ɜːr、ər”(原始形式“U”、“[”、“L”)的有毛病,需要进一步人工修正。

针对原始文本可以提供一个思路。先为音标部分加入html标签,用Kingsoft Phonetic渲染音标,保存浏览器中的结果即可,方便快速,100%准确。

你说的这种办法保存出来的结果是错的,依然是“乱码”,自己试试就知道了,私码修改要用(正确的)映射表替换的方法,浏览器只保证渲染效果,不会改变原始数据。而且,给原始文本音标部分加标签工作原理和用正则、程序直接修改音标没什么区别,同样各种误伤与漏改。

还有,牛高4用金山音标字体渲染出来的音标本身有些就是错误的,或者说,不规范,不符合牛高4纸本书的音标标注,像“watch”,金山字体渲染出来的音标是 /wɔtʃ; wɑtʃ/ ,实际上纸版书的音标是 /wɒtʃ; wɑtʃ/ ,“put”金山音标是 /put; pʊt/ ,纸本书上是 /pʊt; put/ ,何以至此,我也不清楚,我校改其文本的原则是尽量符合纸本书原文。

此处的“查表替换”程序存在着一个陷阱,就是英文单词有一词多音现象,python词典(dict)里键值唯一,意味着最终会导致多音词都被修改为同一个靠后的发音,幸好英文里多音词比较少,后面可以人工纠正。

顺手统计了一下,牛高四中的多音词比我设想的多。

abstract => 3 | essay => 3 | for => 3 | outside => 3

absent => 2 | abuse => 2 | accent => 2 | advocate => 2 | affiliate => 2 | affix => 2 | agglomerate => 2 | aggregate => 2 | alloy => 2 | ally => 2 | alternate => 2 | animate => 2 | appropriate => 2 | approximate => 2 | articulate => 2 | aspirate => 2 | associate => 2 | attribute => 2 | bass => 2 | bayonet => 2 | belay => 2 | bow => 2 | buffet => 2 | bully => 2 | but => 2 | can => 2 | char => 2 | chink => 2 | cleanly => 2 | clerk => 2 | close => 2 | co-ordinate, coordinate => 2 | co-star => 2 | collect => 2 | combine => 2 | commune => 2 | compact => 2 | complement => 2 | complex => 2 | compound => 2 | compress => 2 | condition => 2 | conduct => 2 | confederate => 2 | conflict => 2 | conjure => 2 | conscript => 2 | conserve => 2 | console => 2 | consort => 2 | consummate => 2 | contact => 2 | content => 2 | contract => 2 | contrary => 2 | contrast => 2 | converse => 2 | convert => 2 | convict => 2 | convoy => 2 | course => 2 | covert => 2 | decoy => 2 | decrease => 2 | defect => 2 | defile => 2 | degenerate => 2 | delegate => 2 | deliberate => 2 | derby => 2 | desert => 2 | designate => 2 | desolate => 2 | diffuse => 2 | digest => 2 | discard => 2 | discharge => 2 | discipline => 2 | discount => 2 | discourse => 2 | document => 2 | duplicate => 2 | elaborate => 2 | entrance => 2 | escort => 2 | estimate => 2 | excess => 2 | excise => 2 | excuse => 2 | expatriate => 2 | exploit => 2 | export => 2 | expose => 2 | extract => 2 | ferment => 2 | finger => 2 | fob => 2 | forearm => 2 | forte => 2 | fragment => 2 | frequent => 2 | gallant => 2 | gill => 2 | graduate => 2 | grave => 2 | have => 2 | house => 2 | hydrate => 2 | impact => 2 | implant => 2 | implement => 2 | import => 2 | impress => 2 | imprint => 2 | incarnate => 2 | incense => 2 | incline => 2 | incorporate => 2 | increase => 2 | indent => 2 | initiate => 2 | inland => 2 | inlay => 2 | insert => 2 | insult => 2 | interchange => 2 | interdict => 2 | interlock => 2 | intimate => 2 | intrigue => 2 | invalid => 2 | inverse => 2 | invite => 2 | mare => 2 | minute => 2 | misconduct => 2 | miscount => 2 | mishit => 2 | mismatch => 2 | misprint => 2 | misuse => 2 | moderate => 2 | mouth => 2 | must => 2 | noodle => 2 | object => 2 | offset => 2 | orient => 2 | ornament => 2 | overall => 2 | overbid => 2 | overflow => 2 | overhang => 2 | overhaul => 2 | overhead => 2 | overlap => 2 | overlay => 2 | overnight => 2 | overprint => 2 | overthrow => 2 | overwork => 2 | pace => 2 | pasty => 2 | pate => 2 | patent => 2 | pedal => 2 | pension => 2 | perfect => 2 | perfume => 2 | permit => 2 | pervert => 2 | piano => 2 | pommel => 2 | pontificate => 2 | postulate => 2 | precipitate => 2 | predicate => 2 | prefix => 2 | presage => 2 | present => 2 | primer => 2 | process => 2 | produce => 2 | progress => 2 | project => 2 | prolapse => 2 | prospect => 2 | prostrate => 2 | protest => 2 | purport => 2 | quadruple => 2 | re-count => 2 | read => 2 | rebel => 2 | rebound => 2 | recall => 2 | recap => 2 | recoil => 2 | record => 2 | refill => 2 | refit => 2 | refund => 2 | refuse => 2 | regenerate => 2 | regiment => 2 | rehash => 2 | reincarnate => 2 | reject => 2 | rejoin => 2 | relay => 2 | remake => 2 | remount => 2 | replay => 2 | represent => 2 | reprint => 2 | rerun => 2 | research => 2 | resit => 2 | resume => 2 | retake => 2 | rethink => 2 | retread => 2 | reuse => 2 | rewrite => 2 | rose => 2 | row => 2 | scarify => 2 | second => 2 | segment => 2 | separate => 2 | slough => 2 | some => 2 | sou => 2 | sow => 2 | subcontract => 2 | subject => 2 | subordinate => 2 | substantive => 2 | supplement => 2 | surcharge => 2 | surmise => 2 | survey => 2 | suspect => 2 | syndicate => 2 | tarry => 2 | tear => 2 | that => 2 | there => 2 | torment => 2 | transfer => 2 | transplant => 2 | transport => 2 | triplicate => 2 | undercut => 2 | underestimate => 2 | underground => 2 | undertaking => 2 | unused => 2 | update => 2 | upgrade => 2 | uplift => 2 | use => 2 | used => 2 | valet => 2 | viola => 2 | wash => 2 | wind => 2 | woof => 2 | work => 2 | ye => 2

牛高四文本真是一个天坑,本来我认为可能没啥大毛病了,没想到接连暴露出两个问题:一是音标转写错误,或者说不规范,不符合原纸本书,原来的“乱码”文本即使使用金山音标字体,渲染出来的音标字符有些也是错的;二则是词条重复,查了一下,计有如下283个重复entry:

accomplish | accomplished | according | accordingly | amino acid | ascorbic acid | addition | additional | additionally | admittance | advisable | advisability | advocacy | Afrikaans | amortize, amortise | amortization, -isation | analogous | analogously | -ance, -ence | appetizer, appetiser | appetizing, appetising | appetizingly, -isingly | BA | Bailey bridge | Belisha beacon | begot, begotten | bo’sn, bos’n | bound | canonize, canonise | catechize, catechise | certitude | chanty, chantey | characterization, -isation | chose, chosen | climate | climatic | climatically | climatology | coop | coop | co-operate, cooperate | co-operator | co-operation, cooperation | co-operative, cooperative | co-operative | co-operatively | co-opt, coopt | co-ordinate clause | co-ordinate, coordinate | co-ordination | co-ordinator | co-pilot, copilot | co-religionist, coreligionist | co-respondent, corespondent | co-signatory, cosignatory | cohere | comedy | common | commonly | common decency | criticize, criticise | crumb | cycle | cycle | daylight | dog days | dig | dig | disorganize, disorganise | dissolution | do’s and don’ts | don’t | down draught (US down draft) | downtrodden | dump | dump | dumper | economic | electron | eloquence | en passant | en route | eulogize, eulogise | eulogist | eulogistic | extra | extra | extra | extra- | femme fatale | foot | foot-and-mouth (disease) | football | foot-bridge | footfall | foot-fault | foothill | foothold | footlights | footloose | footman | footmark | footnote | footpath | footplate | footprint | foot-slog | footsore | footstep | footstool (also stool) | footway | footwear | footwork | foot | -footed | fertilizer, fertiliser | fib | flick | flick | fly | fly-blown | flycatcher | fly-fish | fly-fishing | fly-paper | fly-spray | flyweight | fly | fly-away | fly-by | fly-by-night | fly-half | fly-past | fly | granary | Greenwich Mean Time | handicraft | hers | Hon Sec | humanize, humanise | humanization, humanisation | hyp-, hypo- | instead | knavish | knavishly | lapis lazuli | leaf | leaf | leafage | leafless | leafy | leaf-mould | light | light-coloured | light bulb | light meter | light pen | light-year | LL B, LL D, LL M | lunar | man | man | man-at-arms | man-eater | man Friday | man-hour | man-hunt | man of letters, woman of letters | man-made | man-of-war | manservant | man-size (also man-sized) | manslaughter | mantrap | man | man | Manx | mar | Mar | matt, mat (US also matte) | matt, mat (US also matte) | mechanic | minimize, minimise | mollycoddle | niggle | niggling | oct(o)- | orth(o)- | oste(o)- | out of | organization, organisation | organizational, organisational | outhouse | pantingly | ped-, pedo- | per pro | philosophy | pictorial | pictorial | pictorially | politic | pre- | probable | probable | probably | procedure | procedural | psych(o)-, psycho- | quid pro quo | quit | quitter | rapist | recount | recover | recoverable | reform | reform | reformer | relay | relay | recusant | richly | richness | rusk | Scot | set square | shirt | shirting | shirt-front | shirt-sleeve | shirt-tail | shirtwaist | shit | sinner | spec | stipple | tap | tap | tap-root | tap-water | tap | tap | tooth | toothed | toothless | toothy (-ier, -iest) | toothily | toothache | toothbrush | toothpaste | tooth-powder | toothpick | tight | tight | tighten | tightly | tightness | tight-fisted | tight-lipped | timetable (also esp US schedule) | top | troglodyte | unlucky | unluckily | unlucky | unluckily | unmade | urn | vaporous | vice versa | viva voce | viva voce | vowel | weed | weed | weedy | weed-killer | Wesleyan | who’re | wife | wifely | woozy | worsted

音标不规范的问题用代码+手工替换修正,冗余重复词条则一一删除。如是则有OX-8。

2 个赞

IPA & KK

IPA63是旧版IPA,IPA88是新版IPA

1 个赞

:hugs: :revolving_hearts: :heartbeat: :sparkling_heart: