Notes on the production of the Wenlin Electronic Edition of the Unabridged HDC Index (by Richard Cook)
关于制作文林电子版《未删节 HDC 索引》的说明(作者:理查德-库克)
This 2016 “Unabridged Hànyǔ Dà Cídiǎn Index — Wénlín Electronic Edition” is the direct result of some 12 years of work, remapping, structuring, extending, and correcting the data used for production of the 2003 print edition of the HDC Index (Mair et al., 2003). After the first print publication of my digitization of Shuōwén (2003; also available in electronic form since the 2011 release of Wenlin software 4.0), the next phases of that project began, to segment, translate, proof and extend the prior work. Nearly twelve years ago to the day as I write this, in March of 2004, Prof. Mair approached me with the offer of a CD of data from his 2003 HDC Index project. Recognizing the value of such data for reading, segmenting and understanding classical texts, I immediately said “Yes!” The CD had been sent to Mair by his associate editor Fang Shizeng (方世增). Mair had left it with John DeFrancis in the offices of the ABC dictionary project at the Univ. of Hawaiʻi, where they did not know what to do with it, since the data was in a non-standard form unintelligible to their computer systems. And so Mair instructed them to send the CD to me in California. At the same time, he sent me a print copy of his HDC Alphabetical Index, and had his colleagues in China send me a copy of the 22-volume HDC itself (the print text upon which the HDC Alphabetical Index had been based). With these tools in hand (not all at the same time), I began the long and “impossible” (as Mair’s colleagues in China had called it) task of converting the original data to Unicode encoding. The job would indeed have been completely impossible had it not been for the experience in such matters which we at Wenlin Institute had gained in related work over the prior decade.
2016 年出版的《未删节 Hànyǔ Dà Cídiǎn 索引–文林电子版》是约 12 年工作的直接成果,对 2003 年印刷版《HDC 索引》(Mair et al.)在我的《说文》数字化版本(2003 年;自 2011 年发布文林软件 4.0 以来也有电子版)首次印刷出版后,该项目的下一阶段工作就开始了,对之前的工作进行分段、翻译、校对和扩展。就在我写这篇文章的将近 12 年前,即 2004 年 3 月,Mair 教授找到我,向我提供了他 2003 年 HDC 索引项目的数据光盘。我认识到这些数据对于阅读、分割和理解经典文本的价值,当即表示 "好的!"这张光盘是他的副主编方世增寄给 Mair 的。Mair 把它交给了夏威夷大学 ABC 词典项目办公室的 John DeFrancis,他们不知道该如何处理它,因为数据是非标准格式,他们的计算机系统无法理解。于是,迈尔指示他们把光盘寄给我在加利福尼亚的同事。同时,他给我寄来了他的《HDC 按字母顺序排列的索引》的印刷本,并让他在中国的同事给我寄来了 22 卷本的《HDC》本身(《HDC 按字母顺序排列的索引》就是以印刷本为基础的)。有了这些工具(并非同时),我开始了将原始数据转换为 Unicode 编码的漫长而 “不可能”(麦尔在中国的同事称之为 “不可能”)的任务。如果不是因为我们文林研究所在过去十年的相关工作中积累了丰富的经验,这项工作确实是完全不可能完成的。
The original HDC Index data had some 350,000 entries, encoded in a private (“hacked”) GB-2312 extension, re-using unused GB-2312 code points in order to represent HDC characters not available in GB-2312. That is, the researchers in China, working on this project over many years, and having started at a time when GB 18030 (which extends GB-2312 and maps algorithmically to Unicode) had not yet been invented, had made an inventory of which characters were needed to digitize the polysyllabic entries of HDC. A subset of those characters were available in GB-2312 (the normal simplified 2-byte character encoding), and GB-2312 itself contained a number of other characters which were not used in HDC polysyllables. (For example, HDC has no modern simple-form characters in polysyllabic entries at all: polysyllabic entries are only given in “HDC Full” form, as mentioned above.) Thus, a number of code points in GB-2312 which remained as yet unused in the HDC Index digitization could be re-used in order to represent relatively rare or unencoded HDC characters. This kind of “extended” GB-2312 character set, to which was then added a number of private-use code points not normally used in GB-2312, is non-standard in the extreme, and was completely undocumented, except insofar as the items in the printed HDC Alphabetical Index mapped to the entries in the original 22-volume HDC set. That is, one could look at an entry in the printed Index, match it to a line in the electronic file, and getting the HDC volume off the shelf, check both against the original HDC page, and arrive at some sense of what the non-standard electronic characters should be. Having created statistical tables of character and code point usage in the original data files, the task of identifying the reused code points, and remapping them to the formal Unicode code points, could begin. The task was not infinite, or else we would not have the result before us now: but it was very long, and refined in multiple programming and proofing stages (which still continue today), using a variety of tools and methods, some developed specifically for this project.
最初的 HDC 索引数据有大约 350,000 个条目,这些条目是用私人(“黑客”)GB-2312 扩展编码的,重新使用了未使用的 GB-2312 码位,以表示 GB-2312 中没有的 HDC 字符。也就是说,中国的研究人员经过多年的努力,在 GB 18030(它扩展了 GB-2312,并通过算法映射到统一码)尚未发明时就开始了这一项目,并清点了将 HDC 的多音节词条数字化所需的字符。这些字符的一个子集在 GB-2312(普通的简化 2 字节字符编码)中可用,而 GB-2312 本身还包含许多其他字符,但这些字符在 HDC 多音节词中没有使用。(例如,HDC 的多音节词条中根本没有现代简体字符:如上所述,多音节词条只以 "HDC Full "形式提供)。因此,GB-2312 中一些在《HDC 索引》数字化过程中尚未使用的码位可以重新使用,以表示相对罕见或未编码的 HDC 字符。这种 "扩展的 "GB-2312 字符集,再加上一些 GB-2312 中通常不使用的私人使用的码位,是极端非标准的,而且完全没有记录,只有印刷版《HDC 字典索引》中的条目与最初的 22 卷 HDC 字符集中的条目相对应。也就是说,我们可以查看印刷版索引中的条目,将其与电子文档中的一行进行匹配,然后从书架上取下 HDC 卷,将两者与原始 HDC 页进行核对,从而得出一些非标准电子字符的含义。 在创建了原始数据文件中字符和码位使用情况的统计表之后,就可以开始识别重复使用的码位,并将它们重新映射到正式的 Unicode 码位。这项工作并不是无限的,否则就不会有现在的结果:但这项工作非常漫长,需要经过多个编程和校对阶段(至今仍在继续),使用各种工具和方法,其中一些是专门为该项目开发的。
In tandem with the HDC remapping project, we at Wenlin Institute had been developing CDL font technologies, and promoting this technology in international standards arenas, for management of Unihan data. One result was Appendix F of the Unicode Standard, documenting the CJK Strokes block which we helped to encode. CDL is a multi-faceted tool, giving us the ability, for example, to create the precise character forms needed, and the most detailed indexes imaginable, of radical/stroke information, for the complete Unicode CJKV character set. At present, the CDL database contains more than 100,000 descriptions and more than one million strokes. The CDL database, refined and extended in conjunction with our Unihan and Shuōwén projects, became instrumental to the HDC Index digitization project. Using CDL data, we were able to digitize the complete HDC radical/stroke chart, including all monosyllabic, Simple and Full entries, and extend that chart to ensure that it included all characters and variants attested in HDC polysyllables. CDL data enabled us to re-sort our Unicode version of the Alphabetical HDC Index data back into the order of the original 22-volume HDC text, for the final proofing stages. We had during this time also gained access to digital images of all pages of the 22-volume HDC text, and had linked those images into our database look-up system, to speed up the proofing process. It was no longer necessary to hoist the large volumes on and off the shelves in the checking process. Rather, checks could be made with a click of the mouse. And we had worked with colleagues in the Unicode Consortium to produce (among other things) complete mapping and pīnyīn data for 《汉语大字典》 Hànyǔ Dà Zìdiǎn (HDZ), the comprehensive Chinese character dictionary in eight volumes (Wuhan, 1986), which had served as an important reference work for the original HDC (2001) compilers. Our HDZ pīnyīn data (monosyllable readings) was key to our proofing all of the HDC pīnyīn data.
在开展 HDC 重映射项目的同时,我们文林研究所一直在开发 CDL 字体技术,并在国际标准领域推广这项技术,以管理 Unihan 数据。其中一项成果是统一码标准的附录 F,其中记录了我们帮助编码的中日韩笔画块。CDL 是一个多方面的工具,使我们有能力为完整的 Unicode CJKV 字符集创建所需的精确字符形式和所能想象到的最详细的部首/笔画信息索引。目前,CDL 数据库包含 10 万多个描述和 100 多万个笔画。CDL 数据库与我们的 Unihan 和 Shuōwén 项目共同完善和扩展,成为 HDC 索引数字化项目的重要工具。利用 CDL 数据,我们能够将完整的 HDC 部首/笔画表数字化,包括所有单音节、简单和完整条目,并扩展该表以确保其包括 HDC 多音节中证实的所有字符和变体。CDL 数据使我们能够按照 22 卷 HDC 原始文本的顺序,对 Unicode 版本的字母 HDC 索引数据进行重新排序,以便进行最后的校对工作。在此期间,我们还获得了 22 卷 HDC 文本所有页面的数字图像,并将这些图像链接到我们的数据库查询系统中,以加快校对过程。在核对过程中,不再需要将大卷图书搬上搬下书架。相反,只需点击一下鼠标就可以进行核对。 我们还与统一码联盟的同事合作,为《汉语大字典》Hànyǔ Dà Zìdiǎn (HDZ)(八卷本的综合性汉字字典,武汉,1986 年)提供了完整的映射和 pīnyīn 数据(除其他外),该字典是《汉语大字典》(2001 年)最初编纂者的重要参考书。我们的 HDZ pīnyīn 数据(单音节读音)是我们校对所有 HDC pīnyīn 数据的关键。
The HDC Index data is multi-layered, and each layer adds its own complications to the proofing process. As mentioned in the Index’s original front matter, it was a significant challenge for the editors and inputters simply to decide on the “correct” pīnyīn reading for some entries. Repeat that process hundreds of thousands of times using radical-stroke charts and other paper indexes, with numerous print sources and many contributors, and some amount of variation and noise is bound to occur. HDC includes an abundance of vocabulary drawn up (by slender rope) from the deep well of Chinese history. Some words have been written in various ways, and pronunciations have changed dramatically over time. Many words are not in common use today, and even for those words which may seem to be familiar, there are perils in applying modern understanding to the interpretation of ancient materials. Nevertheless, the editors and inputters managed to work together to digitize all of the different character forms (~15,000 types) occurring in polysyllables, and arrive at a pīnyīn reading for every single syllable (~800,000) of every single entry (~350,000). The resulting raw text of the Index contained some 10 million characters. The main computer program used to reformat and proof the raw data has some 3300 lines of code: this includes some 800 lines of errata, but excludes several thousand lines of external Hànzì mapping and pīnyīn spell-checking tables (not to mention all of the Wenlin C source code). The resulting Wenlin electronic database contains some 58 million bytes (excluding the tree index, which contains more than 112 million bytes). For an idea of what numbers this big mean: if 58 million bytes were to tick by like seconds, that would take more than 22 months.
HDC 索引的数据是多层次的,每一层都给校对过程增加了复杂性。正如索引最初的封面所提到的,对于编辑和输入人员来说,仅仅为某些条目确定 "正确的 "pīnyīn 读法就是一项巨大的挑战。使用部首笔画表和其他纸质索引,在众多印刷来源和众多撰稿人的情况下,重复这一过程成百上千次,必然会出现一些变化和噪音。HDC 收录了大量从中国历史深井中(用细绳)提取的词汇。有些词的写法各不相同,发音也随着时间的推移发生了巨大变化。许多词今天已不常用,即使是那些看似熟悉的词,在解释古代材料时应用现代理解也有风险。尽管如此,编辑和输入人员还是共同努力,将多音节词中出现的所有不同字形(约 15,000 种)数字化,并为每一个词条(约 350,000 个)的每一个音节(约 800,000 个)确定了 pīnyīn 读法。索引的原始文本包含约 1,000 万个字符。用于重新格式化和校对原始数据的主要计算机程序有大约 3300 行代码:其中包括大约 800 行勘误,但不包括几千行外部 Hànzì 映射和 pīnyīn 拼写检查表(更不用说所有的文林 C 源代码)。由此产生的文林电子数据库包含约 5,800 万字节(不包括树索引,它包含超过 1.12 亿字节)。为了了解这么大的数字意味着什么:如果 5,800 万字节像秒钟一样流逝,那将需要超过 22 个月的时间。
The correctness of a given pīnyīn reading in that vast ocean of text is demonstrable in terms of HDZ and its sources, and in terms of what HDC itself has to say about the readings of its characters. In some cases, HDZ and HDC disagree, with each other or internally. HDZ only presents partial information, and sometimes contradicts itself. HDC may present information unavailable in HDZ, possibly reflecting different focus, differing interpretation, or revision. HDC monosyllable entries usually list all of the various HDC pronunciations of a given character attested in polysyllables. But sometimes a reading used in a given polysyllable is noted in HDC only under that polysyllabic entry (possibly in relation to a variant writing). And sometimes a reading of a character appearing in the HDC Index is unattested in HDZ, and unsubstantiated anywhere in HDC itself. HDC assigns a number to each pīnyīn reading of a given character, and sometimes these numbers were confused with tone numbers during input. With the combination of many such things considered, there are many ways for things to go wrong: it sometimes becomes hard to know what “right” might mean, and the proofing of the remapped HDC Index data starts to reach the maximum level of complexity. One is forced to consider the intentions of the editors and typists: were they faithfully representing the available textual information, interpreting it to the limited extent required for digitization, or revising it, intentionally or accidentally? One is forced sometimes even to try to guess the dialect of the typist, in an attempt to identify and correct obvious errors and inconsistencies, using statistical and other evidence unavailable to previous editors and proofers. It is indeed a difficult task even to try to weigh all the evidence in any given problematic case. One must apply corrections to the data only where evidence and confidence abound. Where the evidence was unclear, or confidence lacking, we have registered our doubt with a question-mark in the Zìdiǎn entry, and left the text of the HDC Index untouched. The father of Chinese lexicography 許慎 Xǔ Shèn (d. ~147) expressed a similar “hands-off” editorial philosophy at the end of his Shuōwén Postface (translated by me, 2003:47): “hearing doubts, I convey doubts”. We have been very fortunate to hear only statistically minor doubts, thanks to the rigorous editorial standards evident in the original works. But when dealing with such enormous printed works, even tiny and normally insignificant irregularities may accumulate, and will need to be addressed in some way or another, if computer code is to run reliably. In our work to give the printed HDC Index a digital incarnation, to try to open a new avenue to the gateway to “the courtyard of the sacred arts”, we have been forced to confront our own limitations, and to work with limited resources at the limits of human endurance. Language is imprecise, writing barely captures speech, and histories are told from many perspectives, imperfectly transmitted. Having easier access to more information sometimes leads to clear answers, but if it only leads to the formulation of new questions then that too is a kind of success. The wide span of language and literature reevaluated in HDC itself, was revisited in the HDC Index, and reworked in our digitization. Each generation hopes to benefit from prior generations of study, and in this spirit we hope that our current findings provide a useful foundation for the building of far better tools in the future, for easier access and improved understanding of Chinese textual traditions.
在浩如烟海的文本中,特定 pīnyīn 读法的正确性可以从 HDZ 及其来源,以及 HDC 本身对其字符读法的论述中得到证明。在某些情况下,HDZ 和 HDC 之间或内部存在分歧。HDZ 只提供部分信息,有时甚至自相矛盾。HDC 可能提供 HDZ 无法提供的信息,可能反映了不同的侧重点、不同的解释或修订。HDC 单音节词条通常会列出某个字符在多音节中的所有 HDC 读音。但有时,某个多音节词中使用的读音在 HDC 中只出现在该多音节词条目下(可能与变体书写有关)。有时,HDC 索引中出现的某个字符的读音在 HDZ 中没有证明,在 HDC 本身的任何地方也没有得到证实。HDC 为给定字符的每个 pīnyīn 读音分配了一个编号,有时在输入过程中这些编号会与声调编号相混淆。考虑到许多这样的事情,出错的方式有很多:有时很难知道 "正确 "可能意味着什么,重新映射的 HDC 索引数据的校对开始达到最大的复杂程度。我们不得不考虑编辑和打字员的意图:他们是忠实地反映了现有的文本信息,在数字化所需的有限范围内解释了这些信息,还是有意无意地修改了这些信息?有时,我们甚至不得不尝试猜测打字员的方言,试图利用以前的编辑和校对人员无法获得的统计和其他证据,找出并纠正明显的错误和不一致之处。 在任何一个有问题的案例中,即使试图权衡所有证据也确实是一项艰巨的任务。只有在证据和信心充足的情况下,我们才必须对数据进行修正。在证据不明确或缺乏信心的情况下,我们在 Zìdiǎn 条目中用问号表示我们的怀疑,而对《HDC 索引》的正文不作任何改动。中国词学之父许慎(卒于约 147 年)在《说文解字序》(由我翻译,2003:47)的结尾表达了类似的 "放手 "编辑理念:“听疑则传疑”。我们非常幸运地只听到了统计意义上的小疑问,这要归功于原著中显而易见的严格编辑标准。但是,在处理如此巨大的印刷作品时,即使是微小的、通常无关紧要的不规范之处也可能积累起来,如果计算机代码要可靠地运行,就需要以某种方式加以解决。在我们为印刷版《HDC 索引》提供数字版本,试图为通往 "神圣艺术殿堂 "的大门开辟一条新途径的过程中,我们不得不面对自身的局限性,不得不在人类耐力的极限下利用有限的资源开展工作。语言是不精确的,文字只能勉强捕捉言语,而历史则是从多个角度讲述的,其传播方式并不完美。更容易获得更多的信息有时会带来明确的答案,但如果它只是导致提出新的问题,那也是一种成功。HDC 本身重新评估了语言和文学的广泛跨度,HDC 索引重新审视了这一跨度,我们的数字化工作也对这一跨度进行了再加工。 每一代人都希望从上一代人的研究中获益,本着这种精神,我们希望我们目前的研究成果能为将来建立更好的工具提供有益的基础,使人们更容易获取和更好地理解中国的文字传统。