赞同,我认为新的词典格式应该基于 json,词典本就贵精不贵多。
主做 epub,兼容词典需求,类似 kindle 的 kfx 格式,是最佳选择。实际上可以直接在 epub 的目录中增加一个 entries 目录,把词条放这个目录下就可以了,其他不需要变。
我这里只是用 epub 举例,实际应该用 epub 的架构,自己重新设计文档的 ast。
词典的风口已经过去了,再做一个 html 的词典应用就是大步走向旧社会,从产品角度讲没有做的必要,现有的 mdx 已经满足了。
我认为基于 html 的 mdx 词典的最大问题在于,html 格式过于灵活自由,各个词典制作者采用的 class、html 结构都不一致,想要做义项的解析等工作,需要一本本词典做适配,没有一个统一的方案。
mdict最逆天的,必须解决的:
- 文本没规范化,不仅是词条,对于文件(夹)名会有更严重的问题,文件往MacOS过一遍都成NFD规范化的了,这些打包进去坑巨多
MDict:café和café是不同的词条,查不到mdx v3里的café - 词典文件 资源文件 搭配的css js 封面图 等 这些文件名没个明确的规范,更不用说优先级
@@@LINK=逻辑完全乱套,是共享历史还是重复显示?link的带#的就是词条还是词条定位到id?照着原作者的说明意思是目标词条就一个,但很多词典不止一个- 后人埋下的坑:把明确说明的“使用
<a href="sound……的形式,仅支持……”当摆设,遇到这些还得额外适配。这个词典软件的listener还可能跟词典里的冲突 - 后人埋下的大坑:早期mdx并未开源,文件格式官方也没公开,照着逆向然后兼容的并不可靠,像mdx v2词条内容后面固定的
\r\n终止符截掉就行,有的只有\n终止符可能导致截掉html的>,甚至一些mdx还无中生有,像big5编码
其次再考虑新的功能、内容结构
像mdx只支持纯英文排序和忽略大小写,其他语言不可用,虽然mdx v3有了新的,但这个怕是坑更多。没有“辅助索引(对用户来讲)”,@@@LINK可能救了点,但也造成词典里的内容更乱了……
这个我不完全认同,一般AI来“说”JSON比XML更能胜任,但喂给AI恐怕不是这样的,JSON结构复杂起来就容易遇到}}]}],xml或者html里的一段文本里一些词的 加粗 斜体 链接 表示起来连贯又简单,如果非要用JSON又得套层}]还额外带若干标点,XML标签闭合而且标签本身带有强烈的语义信息,像这个朗文的光take词条的take up短语就这么多,读到</PhrVbEntry>知道这个动词短语结束了,接着下一个<PhrVbEntry>或者</Entry>整个词条结束了,JSON不仅这些都没有而且还不支持注释
<PhrVbEntry id="u2fc098491a42200a.6e2b450a.1150446158e.28f2">
<Head>
<PHRVBHWD>take up</PHRVBHWD>
<POS>phr v</POS>
</Head>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.28f7">
<LEXUNIT>take sth ↔ up</LEXUNIT>
<DEF>to become interested in a new activity and to spend time doing it</DEF>
<ExpandableInformation>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.28fb">Roger took painting up for a while, but soon lost interest.</EXAMPLE>
</ExpandableInformation>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.28fc">
<LEXUNIT>take sth up</LEXUNIT>
<DEF>to start a new job or have a new responsibility</DEF>
<ExpandableInformation>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.28ff">Peter will take up the management of the finance department.</EXAMPLE>
<ColloExa>
<COLLO>take up a post/a position/duties etc</COLLO>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2902">The headteacher takes up her duties in August.</EXAMPLE>
</ColloExa>
</ExpandableInformation>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2903">
<LEXUNIT>take sth ↔ up</LEXUNIT>
<DEF>if you take up a suggestion, problem, complaint etc, you start to do something about it</DEF>
<ExpandableInformation>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2907">Now the papers have taken up the story.</EXAMPLE>
<GramExa>
<PROPFORMPREP>with</PROPFORMPREP>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.290a">The hospital manager has promised to <COLLOINEXA>take the matter up</COLLOINEXA> with the member of staff involved.</EXAMPLE>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.290c">I am still very angry and will be taking it up with the authorities.</EXAMPLE>
</GramExa>
</ExpandableInformation>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.290d">
<LEXUNIT>take up sth</LEXUNIT>
<DEF>to fill a particular amount of time or space</DEF>
<ExpandableInformation>
<GramExa>
<PROPFORM>be taken up with sth</PROPFORM>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2912">The little <COLLOINEXA>time</COLLOINEXA> I had outside of school was <COLLOINEXA>taken up</COLLOINEXA> with work.</EXAMPLE>
</GramExa>
<ColloExa>
<COLLO>take up space/room</COLLO>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2917">old books that were taking up space in the office</EXAMPLE>
</ColloExa>
</ExpandableInformation>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2918">
<LEXUNIT>take sth ↔ up</LEXUNIT>
<DEF>to accept a suggestion, offer, or idea</DEF>
<ExpandableInformation>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.291b">Rob <COLLOINEXA>took up the invitation</COLLOINEXA> to visit.</EXAMPLE>
<ColloExa>
<COLLO>take up the challenge/gauntlet</COLLO>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.291f">Rick took up the challenge and cycled the 250-mile route alone.</EXAMPLE>
</ColloExa>
</ExpandableInformation>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2920">
<LEXUNIT>take up sth</LEXUNIT>
<DEF>to move to the exact place where you should be, so that you are ready to do something</DEF>
<ExpandableInformation>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2923">The runners are <COLLOINEXA>taking up</COLLOINEXA> their <COLLOINEXA>positions</COLLOINEXA> on the starting line.</EXAMPLE>
</ExpandableInformation>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2926">
<LEXUNIT>take sth ↔ up</LEXUNIT>
<DEF>to make a piece of clothing shorter</DEF>
<OPP>let down</OPP>
</Sense>
<Sense id="u2fc098491a42200a.6e2b450a.1150446158e.292b">
<LEXUNIT>take sth ↔ up</LEXUNIT>
<DEF>to continue a story or activity that you or someone else had begun, after a short break</DEF>
<ExpandableInformation>
<EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.292f">I’ll take up the story where you left off.</EXAMPLE>
</ExpandableInformation>
</Sense>
</PhrVbEntry>
llm 的函数调用输入输出全是 json,xml 太复杂了,你自己都不一定能提取到正确的 json。 xdxf 已经证明了 xml 的失败,没必要再试了。
不懂就问:日本的EPWING词典是什么格式的?
这俩结构就是两个极端
就调用个接口,不涉及复杂多变的嵌套,结构固定,key无序,不涉及富文本,大可不必用XML,像楼上那个demo,发给后端的就俩key,回来的就一个[],没有比JSON更简单的了
JSON适合处理规整的数据,结构是绝对层级化,但到了复杂的人阅读的文档,很难搞,XML或者HTML天生就是干这个的。JSON的key无序,先来个[]然后for里if再若干个elseif后下一层;一旦遇到富文本,用JSON就是灾难,然道要再套一层[{}, {}……]然后for里又if?像这个词典的某些固定结构就没有固定的层级或位置,可能在释义里,在在释义的某个用法里,在释义外……
这些在网页上用css实现 I waved, but he didn’t take any notice (=pretended not to notice). British English 极其简单,到了JSON又得套娃,这还只是一个例句
<EXAMPLE id="u2f……">I waved, but he <COLLOINEXA>didn’t take any notice</COLLOINEXA><GLOSS>pretended not to notice</GLOSS>.<GEO>BrE</GEO></EXAMPLE>
整成JSON也不是不行,但代价……不过有轮子专门干这个
{
"type": "EXAMPLE",
"content": [
{
"type": "text",
"text": "I waved, but he "
},
{
"type": "COLLOINEXA",
"text": "didn’t take any notice"
},
{
"type": "GLOSS",
"text": "pretended not to notice"
},
{
"type": "GEO",
"text": "BrE"
},
{
"type": "text",
"text": "."
}
]
}
内容类似XML或者HTML,但以二进制形式存在,估计受限于当时的硬件,用“控制码”代替例如<b>``</b>这种。还有点阵图字符、绝对地址的跳转,这些很有年代感
为这个稍微写了点css,很顺利,对于结构性强的文档必须得靠XML或者HTML,用JSON只会更复杂,标签本身带有强烈的语义信息,写看见带</> 就知道哪端段结束了,比满是div span的清晰多了,更是远好于一堆括号
<sense-data id="u2fc098491a42200a.6e2b450a.1150446158e.2641">
<sign-post>action</sign-post>
<grammar-label>T</grammar-label>
<sense-definition>used with a noun instead of using a verb to describe an action. For example, if you take a walk, you walk somewhere</sense-definition>
<expandable-information>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2646">Would you like to take a look?</example-sentence>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2647">Mike’s just taking a shower.</example-sentence>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2648">Sara took a deep breath.</example-sentence>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2649">I waved, but he <collocation-highlight>didn’t take any notice</collocation-highlight><context-gloss>pretended not to notice</context-gloss>.<GEO>BrE</GEO></example-sentence>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.264d">Please <collocation-highlight>take a seat</collocation-highlight><context-gloss>sit down</context-gloss>.</example-sentence>
<ColloExa>
<collocation-pattern>take a picture/photograph/photo</collocation-pattern>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2652">Would you mind taking a photo of us together?</example-sentence>
</ColloExa>
</expandable-information>
</sense-data>
<sense-data id="u2fc098491a42200a.6e2b450a.1150446158e.266a">
<sign-post>remove</sign-post>
<grammar-label>T</grammar-label>
<sense-definition>to remove something from a place</sense-definition>
<Thesref>
<Crossrefto refid="u2fc098491a42200a.262cc60a.117ee7c0f66.-1683">
<REFHWD>steal</REFHWD>
<REFHOMNUM>1</REFHOMNUM>
</Crossrefto>
</Thesref>
<expandable-information>
<grammar-example>
<pattern-form>take sth off/from etc sth</pattern-form>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2670">Take your feet off the seats.</example-sentence>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2671">Someone’s taken a pen from my desk.</example-sentence>
</grammar-example>
<example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2672">Police say money and jewellery were taken in the raid.</example-sentence>
<Crossref>
<Crossrefto refid="u2fc098491a42200a.6e2b450a.1150446158e.27e5">
<REFHWD>take </REFHWD>
</Crossrefto>
</Crossref>
</expandable-information>
</sense-data>
如果内容被规范的话CSS就很简单,留给普通用户的自定义空间也更多,按板块提取信息也简单,像提取例句、搭配、某个搭配下的例句、动词短语、这个词的某个动词短语的例句……都很好说
韦氏官网的词条页面是 json 格式的数据源生成的,从词条结构上说在英语词典里韦氏词典应该是比较复杂的了,我不觉得 json 会有什么严重的问题,富文本的实现也可以参考韦氏官网自定义的语法标签,比如 {it}{/it}。
android 17 会正式支持应用级别的函数调用,输入输出都是 json。我估计所有词典应用都需要实现这个功能,到时候 mdx 的问题就会放更大了。
不可能是JSON,大多结构复杂的词典底层都是XML或类似物,XML都能转成JSON,但怎么转得商量好,像这个虽然转到了JSON,但处理起来很麻烦,异构数组使得JSON本该清晰的逻辑混乱、解析更费劲,扩展性差,转到对象数组嵌套能更清晰但层数更多,文本用字符串再来个{wi}不还是XML类似物吗{\/wi}?文字带个链接这种很简单的需求到这得复杂多少?(下面的json和xml都来自 https://dictionaryapi.com/products/json 经格式化,然后套了一层)
[
[
"sense",
{
"sn": "1 a",
"sgram": "T\/I",
"dt": [
[
"text",
"{bc}to shut down and restart (a computer or program) "
],
[
"vis",
[
{
"t": "\u2026 the annoyance of having to {wi}reboot{\/wi} thecomputer to switch operating systems \u2026",
"aq": {
"auth": "Robert Weston"
}
},
{
"t": "If anything ever happens to the original drive, you can{wi}reboot{\/wi} using the cloned drive and be up and runningin minutes.",
"aq": {
"auth": "Dan Frakes"
}
}
]
]
]
}
],
[
"sense",
{
"sn": "b",
"sgram": "I",
"dt": [
[
"text",
"{bc}to start up again after closing or shutting down {bc}toboot up again "
],
[
"vis",
[
{
"t": "waiting for a computer\/program to {wi}reboot{\/wi}"
}
]
]
]
}
]
]
<sseq>
<sense>
<sn>1 a</sn>
<sgram>T/I</sgram>
<dt>{bc}to shut down and restart (a computer or program)
<vis>
<vi>
<t>… the annoyance of having to {wi}reboot{/wi} the computerto switch operating systems … </t>
<aq>
<auth>Robert Weston</auth>
</aq>
</vi>
<vi>
<t>If anything ever happens to the original drive, you can {wi}reboot{/wi}using the cloned drive and be up and running in minutes. </t>
<aq>
<auth>Dan Frakes</auth>
</aq>
</vi>
</vis>
</dt>
</sense>
<sense>
<sn>b</sn>
<sgram>I</sgram>
<dt>{bc}to start up again after closing or shutting down {bc}toboot up again
<vis>
<vi>
<t>waiting for a computer/program to {wi}reboot{/wi}</t>
</vi>
</vis>
</dt>
</sense>
</sseq>
本来词典这种结构复杂的就应该用XML,转到JSON,解析起来跟手写解析XML没什么区别,除非舍弃些结构或信息,也并不清晰,这个更是。sn dt vis 这种阴间命名,{bc} {wi}这种妥协,还有异构数组就恶心人的玩意,终于用上对象了还再套一层,这种屎山没人愿意解
从有d词d整来的,整理出一部分,结构是清晰,但内容……
{
"phonuk": "teɪk",
"phonus": "teɪk",
"defn": [
[
"v.",
"携带,拿走;带去,引领;使达到,提升;拿,取;移走,拿开;偷走,误拿;取材于,收集;攻占,控制;选中,买下;订阅(报纸等);吃,服用;减去;记录,摘录;照相,摄影;量取,测定;就(座);以…...为例;接受,收取;接纳,接待(顾客、患者等);遭受,经受;忍受,容忍;(以某种方式)对待,处理;理解,考虑;误以为;赢得(比赛、竞赛等);产生(感情),持有(看法);采取(措施),采用(方法);做,拥有;采用(形式),就任(职位);花费,占用(时间); 需要,要求;使用;穿(特定尺码的鞋或衣物);容纳;授课;学习,选修(课程);参加(考试或测验);走(路线),乘坐(交通工具);跨过,跳过;踢,掷;举行投票,进行民意调查;成功,奏效;(语法)需带有(某种结构)"
],
[
"n.",
"(一次拍摄的)镜头,场景;收入量;看法,态度;(印刷)一次排版量"
]
],
"form": [
{
"fn": "复数",
"en": "takes"
},
{
"fn": "第三人称单数",
"en": "takes"
},
{
"fn": "现在分词",
"en": "taking"
},
{
"fn": "过去式",
"en": "took"
},
{
"fn": "过去分词",
"en": "taken"
}
],
"phr": [
{
"en": "take care of oneself",
"zh": "照顾自己;颐养"
},
{
"en": "take part",
"zh": "参与, 参加"
},
{
"en": "take part in",
"zh": "参加,参与"
},
{
"en": "take on",
"zh": "承担;呈现;具有;流行;接纳;雇用;穿上"
},
{
"en": "take up",
"zh": "拿起;开始从事"
},
{
"en": "take effect",
"zh": "生效;起作用"
},
{
"en": "take off",
"zh": "起飞;脱下;离开"
},
{
"en": "take a look",
"zh": "看一下"
},
{
"en": "take out",
"zh": "v. 取出;去掉;出发;抵充"
},
{
"en": "take into",
"zh": "考虑到;说服"
},
{
"en": "take in",
"zh": "接受;理解;拘留;欺骗;让…进入;改短"
},
{
"en": "take seriously",
"zh": "重视;认真对待…"
},
{
"en": "take away",
"zh": "带走,拿走,取走"
},
{
"en": "take a look at",
"zh": "[口]看一看;检查"
},
{
"en": "take over",
"zh": "接管;接收"
},
{
"en": "take for granted",
"zh": "认为…理所当然"
},
{
"en": "take the lead",
"zh": "v. 带头;为首"
},
{
"en": "take charge of",
"zh": "接管,负责"
},
{
"en": "take good care",
"zh": "好好照顾;珍重"
}
],
"syn": [
{
"pos": "vt.",
"ws": [
"carry",
"adopt",
"have",
"eat",
"assume"
],
"zh": "拿,取;采取;吃;接受"
},
{
"pos": "vi.",
"ws": [
"pick up",
"get access to"
],
"zh": "拿;获得"
}
],
"der": {
"root": "take",
"rels": [
{
"pos": "adj.",
"words": [
{
"word": "taking",
"tran": "可爱的;迷人的;会传染的"
}
]
},
{
"pos": "n.",
"words": [
{
"word": "taking",
"tran": "取得;捕获;营业收入"
},
{
"word": "taker",
"tran": "接受者;接受打赌的人;捕获者"
}
]
},
{
"pos": "v.",
"words": [
{
"word": "taking",
"tran": "拿;捕捉;夺取(take的ing形式)"
}
]
}
]
},
"sen": [
{
"en":"I\'ll take my coat upstairs. Shall I take yours, Roberta?",
"zh": "我将把我的外套拿到楼上去。要我把你的拿上去吗,罗伯塔?"
},
{
"en":"She can\'t take criticism.",
"zh": "她受不了批评。"
},
{
"en":"We take the \'Express\'.",
"zh": "我们订阅的是《快报》。"
}
]
}
json的主要问题是显示部分和数据结构不一致,要单独处理,大大增加词典制作者的要求和所需时间,所以推广会有很大问题,和yomitan的问题一样。
我之所以主张采用 JSON 的前提,是因为它是当下和可见的未来里最通用、最自然的数据交互格式。XML 的结构比 JSON 更复杂、更自由,从 XML 中稳定地抽取出结构化的 JSON 往往非常困难,最终这里面的复杂度会全部压到软件开发者这边。
确实,使用 JSON 会显著提高词典制作者的门槛,也会拉长制作时间,但像 Yomitan 常用的词典本来也不超过十本,能覆盖日常使用的需求就足够了,新格式完全没必要和 MDX 一起肩并肩走向旧时代。
使用 JSON 一定会损失表现力,但使用 XML 同样也会,你始终没法讨好那些使用 HTML 的人,用户的使用惯性是很难改变的,不如尽早放弃他们,跳出存量市场接触新的目标用户。
XML就是当下和可见的未来里最通用、最自然的文档数据交互格式,无可置疑
Word 词典 网页 都是XML或者HTML,词典更是需要XML,如果能是固定的JSON轻松表达的,那一定不会是精品,甚至连可用的都不是
如果非要用JSON,韦氏词典就是很好的例子
非要JSON,结果完全乱套,人看懂都难,到了文本彻底摆烂了,字符串里再套上自己设计的土味md,经过n层循环终于得到“text”了,拿到字符串之后还得照着xml去解,何尝不是舍近求远?
"{bc}any of several domesticated {dx_def}see {dxt|domesticate:1||2}
{\/dx_def} or wild {d_link|gallinaceous|gallinaceous}
birds {dx}compare {dxt|guinea fowl||}, {dxt|jungle fowl||}{\/dx}"
"Middle English {it}foul{\/it}, from Old English {it}fugel{\/it};
akin to Old High German {it}fogal{\/it} bird, and probably to Old
English {it}fl\u0113ogan{\/it} to fly
{ma}{mat|fly|}{\/ma}"
这只能说是披着JSON的外衣设计了新的“XML”
错综复杂的文档JSON是无解的,有解也是个无底洞,最终一定会演变成用无数补丁堆砌出来的畸形产物
HTML 词典格式已经是过时产物了。只有转向 JSON,才能激活整个 AI 与学习应用生态,让词典不再是孤立的应用,而是一个能被各种工具自由调用的平台。
