且将新火试新茶 制定新的词典格式而不是继续使用mdx

赞同,我认为新的词典格式应该基于 json,词典本就贵精不贵多。

1 个赞

主做 epub,兼容词典需求,类似 kindle 的 kfx 格式,是最佳选择。实际上可以直接在 epub 的目录中增加一个 entries 目录,把词条放这个目录下就可以了,其他不需要变。

我这里只是用 epub 举例,实际应该用 epub 的架构,自己重新设计文档的 ast。

词典的风口已经过去了,再做一个 html 的词典应用就是大步走向旧社会,从产品角度讲没有做的必要,现有的 mdx 已经满足了。

我认为基于 html 的 mdx 词典的最大问题在于,html 格式过于灵活自由,各个词典制作者采用的 class、html 结构都不一致,想要做义项的解析等工作,需要一本本词典做适配,没有一个统一的方案。

1 个赞

mdict最逆天的,必须解决的:

  1. 文本没规范化,不仅是词条,对于文件(夹)名会有更严重的问题,文件往MacOS过一遍都成NFD规范化的了,这些打包进去坑巨多
    MDict:café和café是不同的词条,查不到mdx v3里的café
  2. 词典文件 资源文件 搭配的css js 封面图 等 这些文件名没个明确的规范,更不用说优先级
  3. @@@LINK= 逻辑完全乱套,是共享历史还是重复显示?link的带#的就是词条还是词条定位到id?照着原作者的说明意思是目标词条就一个,但很多词典不止一个
  4. 后人埋下的:把明确说明的“使用<a href="sound……的形式,仅支持……”当摆设,遇到这些还得额外适配。这个词典软件的listener还可能跟词典里的冲突
  5. 后人埋下的大坑:早期mdx并未开源,文件格式官方也没公开,照着逆向然后兼容的并不可靠,像mdx v2词条内容后面固定的\r\n终止符截掉就行,有的只有\n终止符可能导致截掉html的>,甚至一些mdx还无中生有,像big5编码

其次再考虑的功能、内容结构

像mdx只支持纯英文排序和忽略大小写,其他语言不可用,虽然mdx v3有了新的,但这个怕是坑更多。没有“辅助索引(对用户来讲)”,@@@LINK可能救了点,但也造成词典里的内容更乱了……

这个我不完全认同,一般AI来“说”JSON比XML更能胜任,但喂给AI恐怕不是这样的,JSON结构复杂起来就容易遇到}}]}],xml或者html里的一段文本里一些词的 加粗 斜体 链接 表示起来连贯又简单,如果非要用JSON又得套层}]还额外带若干标点,XML标签闭合而且标签本身带有强烈的语义信息,像这个朗文的光take词条的take up短语就这么多,读到</PhrVbEntry>知道这个动词短语结束了,接着下一个<PhrVbEntry>或者</Entry>整个词条结束了,JSON不仅这些都没有而且还不支持注释

  <PhrVbEntry id="u2fc098491a42200a.6e2b450a.1150446158e.28f2">
    <Head>
      <PHRVBHWD>take up</PHRVBHWD>
      <POS>phr v</POS>
    </Head>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.28f7">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to become interested in a new activity and to spend time doing it</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.28fb">Roger took painting up for a while, but soon lost interest.</EXAMPLE>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.28fc">
      <LEXUNIT>take sth up</LEXUNIT>
      <DEF>to start a new job or have a new responsibility</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.28ff">Peter will take up the management of the finance department.</EXAMPLE>
        <ColloExa>
          <COLLO>take up a post/a position/duties etc</COLLO>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2902">The headteacher takes up her duties in August.</EXAMPLE>
        </ColloExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2903">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>if you take up a suggestion, problem, complaint etc, you start to do something about it</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2907">Now the papers have taken up the story.</EXAMPLE>
        <GramExa>
          <PROPFORMPREP>with</PROPFORMPREP>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.290a">The hospital manager has promised to <COLLOINEXA>take the matter up</COLLOINEXA> with the member of staff involved.</EXAMPLE>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.290c">I am still very angry and will be taking it up with the authorities.</EXAMPLE>
        </GramExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.290d">
      <LEXUNIT>take up sth</LEXUNIT>
      <DEF>to fill a particular amount of time or space</DEF>
      <ExpandableInformation>
        <GramExa>
          <PROPFORM>be taken up with sth</PROPFORM>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2912">The little <COLLOINEXA>time</COLLOINEXA> I had outside of school was <COLLOINEXA>taken up</COLLOINEXA> with work.</EXAMPLE>
        </GramExa>
        <ColloExa>
          <COLLO>take up space/room</COLLO>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2917">old books that were taking up space in the office</EXAMPLE>
        </ColloExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2918">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to accept a suggestion, offer, or idea</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.291b">Rob <COLLOINEXA>took up the invitation</COLLOINEXA> to visit.</EXAMPLE>
        <ColloExa>
          <COLLO>take up the challenge/gauntlet</COLLO>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.291f">Rick took up the challenge and cycled the 250-mile route alone.</EXAMPLE>
        </ColloExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2920">
      <LEXUNIT>take up sth</LEXUNIT>
      <DEF>to move to the exact place where you should be, so that you are ready to do something</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2923">The runners are <COLLOINEXA>taking up</COLLOINEXA> their <COLLOINEXA>positions</COLLOINEXA> on the starting line.</EXAMPLE>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2926">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to make a piece of clothing shorter</DEF>
      <OPP>let down</OPP>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.292b">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to continue a story or activity that you or someone else had begun, after a short break</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.292f">I’ll take up the story where you left off.</EXAMPLE>
      </ExpandableInformation>
    </Sense>
  </PhrVbEntry>

llm 的函数调用输入输出全是 json,xml 太复杂了,你自己都不一定能提取到正确的 json。 xdxf 已经证明了 xml 的失败,没必要再试了。

不懂就问:日本的EPWING词典是什么格式的?

这俩结构就是两个极端

就调用个接口,不涉及复杂多变的嵌套,结构固定,key无序,不涉及富文本,大可不必用XML,像楼上那个demo,发给后端的就俩key,回来的就一个[],没有比JSON更简单的了

JSON适合处理规整的数据,结构是绝对层级化,但到了复杂的人阅读的文档,很难搞,XML或者HTML天生就是干这个的。JSON的key无序,先来个[]然后for里if再若干个elseif后下一层;一旦遇到富文本,用JSON就是灾难,然道要再套一层[{}, {}……]然后for里又if?像这个词典的某些固定结构就没有固定的层级或位置,可能在释义里,在在释义的某个用法里,在释义外……

这些在网页上用css实现 I waved, but he didn’t take any notice (=pretended not to notice). British English 极其简单,到了JSON又得套娃,这还只是一个例句

<EXAMPLE id="u2f……">I waved, but he <COLLOINEXA>didn’t take any notice</COLLOINEXA><GLOSS>pretended not to notice</GLOSS>.<GEO>BrE</GEO></EXAMPLE>

整成JSON也不是不行,但代价……不过有轮子专门干这个

{
    "type": "EXAMPLE",
    "content": [
        {
            "type": "text",
            "text": "I waved, but he "
        },
        {
            "type": "COLLOINEXA",
            "text": "didn’t take any notice"
        },
        {
            "type": "GLOSS",
            "text": "pretended not to notice"
        },
        {
            "type": "GEO",
            "text": "BrE"
        },
        {
            "type": "text",
            "text": "."
        }
    ]
}

内容类似XML或者HTML,但以二进制形式存在,估计受限于当时的硬件,用“控制码”代替例如<b>``</b>这种。还有点阵图字符、绝对地址的跳转,这些很有年代感

为这个稍微写了点css,很顺利,对于结构性强的文档必须得靠XML或者HTML,用JSON只会更复杂,标签本身带有强烈的语义信息,写看见带</> 就知道哪端段结束了,比满是div span的清晰多了,更是远好于一堆括号

    <sense-data id="u2fc098491a42200a.6e2b450a.1150446158e.2641">
      <sign-post>action</sign-post>
      <grammar-label>T</grammar-label>
      <sense-definition>used with a noun instead of using a verb to describe an action. For example, if you take a walk, you walk somewhere</sense-definition>
      <expandable-information>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2646">Would you like to take a look?</example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2647">Mike’s just taking a shower.</example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2648">Sara took a deep breath.</example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2649">I waved, but he <collocation-highlight>didn’t take any notice</collocation-highlight><context-gloss>pretended not to notice</context-gloss>.<GEO>BrE</GEO></example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.264d">Please <collocation-highlight>take a seat</collocation-highlight><context-gloss>sit down</context-gloss>.</example-sentence>
        <ColloExa>
          <collocation-pattern>take a picture/photograph/photo</collocation-pattern>
          <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2652">Would you mind taking a photo of us together?</example-sentence>
        </ColloExa>
      </expandable-information>
    </sense-data>
    <sense-data id="u2fc098491a42200a.6e2b450a.1150446158e.266a">
      <sign-post>remove</sign-post>
      <grammar-label>T</grammar-label>
      <sense-definition>to remove something from a place</sense-definition>
      <Thesref>
        <Crossrefto refid="u2fc098491a42200a.262cc60a.117ee7c0f66.-1683">
          <REFHWD>steal</REFHWD>
          <REFHOMNUM>1</REFHOMNUM>
        </Crossrefto>
      </Thesref>
      <expandable-information>
        <grammar-example>
          <pattern-form>take sth off/from etc sth</pattern-form>
          <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2670">Take your feet off the seats.</example-sentence>
          <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2671">Someone’s taken a pen from my desk.</example-sentence>
        </grammar-example>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2672">Police say money and jewellery were taken in the raid.</example-sentence>
        <Crossref>
          <Crossrefto refid="u2fc098491a42200a.6e2b450a.1150446158e.27e5">
            <REFHWD>take </REFHWD>
          </Crossrefto>
        </Crossref>
      </expandable-information>
    </sense-data>

如果内容被规范的话CSS就很简单,留给普通用户的自定义空间也更多,按板块提取信息也简单,像提取例句、搭配、某个搭配下的例句、动词短语、这个词的某个动词短语的例句……都很好说

韦氏官网的词条页面是 json 格式的数据源生成的,从词条结构上说在英语词典里韦氏词典应该是比较复杂的了,我不觉得 json 会有什么严重的问题,富文本的实现也可以参考韦氏官网自定义的语法标签,比如 {it}{/it}

android 17 会正式支持应用级别的函数调用,输入输出都是 json。我估计所有词典应用都需要实现这个功能,到时候 mdx 的问题就会放更大了。

不可能是JSON,大多结构复杂的词典底层都是XML或类似物,XML都能转成JSON,但怎么转得商量好,像这个虽然转到了JSON,但处理起来很麻烦,异构数组使得JSON本该清晰的逻辑混乱、解析更费劲,扩展性差,转到对象数组嵌套能更清晰但层数更多,文本用字符串再来个{wi}不还是XML类似物吗{\/wi}?文字带个链接这种很简单的需求到这得复杂多少?(下面的json和xml都来自 https://dictionaryapi.com/products/json 经格式化,然后套了一层)

[
  [
    "sense",
    {
      "sn": "1 a",
      "sgram": "T\/I",
      "dt": [
        [
          "text",
          "{bc}to shut down and restart (a computer or program) "
        ],
        [
          "vis",
          [
            {
              "t": "\u2026 the annoyance of having to {wi}reboot{\/wi} thecomputer to switch operating systems \u2026",
              "aq": {
                "auth": "Robert Weston"
              }
            },
            {
              "t": "If anything ever happens to the original drive, you can{wi}reboot{\/wi} using the cloned drive and be up and runningin minutes.",
              "aq": {
                "auth": "Dan Frakes"
              }
            }
          ]
        ]
      ]
    }
  ],
  [
    "sense",
    {
      "sn": "b",
      "sgram": "I",
      "dt": [
        [
          "text",
          "{bc}to start up again after closing or shutting down {bc}toboot up again "
        ],
        [
          "vis",
          [
            {
              "t": "waiting for a computer\/program to {wi}reboot{\/wi}"
            }
          ]
        ]
      ]
    }
  ]
]
<sseq>
  <sense>
    <sn>1 a</sn>
    <sgram>T/I</sgram>
    <dt>{bc}to shut down and restart (a computer or program)
      <vis>
        <vi>
          <t>… the annoyance of having to {wi}reboot{/wi} the computerto switch operating systems … </t>
          <aq>
            <auth>Robert Weston</auth>
          </aq>
        </vi>
        <vi>
          <t>If anything ever happens to the original drive, you can {wi}reboot{/wi}using the cloned drive and be up and running in minutes. </t>
          <aq>
            <auth>Dan Frakes</auth>
          </aq>
        </vi>
      </vis>
    </dt>
  </sense>
  <sense>
    <sn>b</sn>
    <sgram>I</sgram>
    <dt>{bc}to start up again after closing or shutting down {bc}toboot up again
      <vis>
        <vi>
          <t>waiting for a computer/program to {wi}reboot{/wi}</t>
        </vi>
      </vis>
    </dt>
  </sense>
</sseq>

本来词典这种结构复杂的就应该用XML,转到JSON,解析起来跟手写解析XML没什么区别,除非舍弃些结构或信息,也并不清晰,这个更是。sn dt vis 这种阴间命名{bc} {wi}这种妥协,还有异构数组就恶心人的玩意,终于用上对象了还再套一层,这种屎山没人愿意解

从有d词d整来的,整理出一部分,结构是清晰,但内容……

{
  "phonuk": "teɪk",
  "phonus": "teɪk",
  "defn": [
    [
      "v.",
      "携带,拿走;带去,引领;使达到,提升;拿,取;移走,拿开;偷走,误拿;取材于,收集;攻占,控制;选中,买下;订阅(报纸等);吃,服用;减去;记录,摘录;照相,摄影;量取,测定;就(座);以…...为例;接受,收取;接纳,接待(顾客、患者等);遭受,经受;忍受,容忍;(以某种方式)对待,处理;理解,考虑;误以为;赢得(比赛、竞赛等);产生(感情),持有(看法);采取(措施),采用(方法);做,拥有;采用(形式),就任(职位);花费,占用(时间); 需要,要求;使用;穿(特定尺码的鞋或衣物);容纳;授课;学习,选修(课程);参加(考试或测验);走(路线),乘坐(交通工具);跨过,跳过;踢,掷;举行投票,进行民意调查;成功,奏效;(语法)需带有(某种结构)"
    ],
    [
      "n.",
      "(一次拍摄的)镜头,场景;收入量;看法,态度;(印刷)一次排版量"
    ]
  ],
  "form": [
    {
      "fn": "复数",
      "en": "takes"
    },
    {
      "fn": "第三人称单数",
      "en": "takes"
    },
    {
      "fn": "现在分词",
      "en": "taking"
    },
    {
      "fn": "过去式",
      "en": "took"
    },
    {
      "fn": "过去分词",
      "en": "taken"
    }
  ],
  "phr": [
    {
      "en": "take care of oneself",
      "zh": "照顾自己;颐养"
    },
    {
      "en": "take part",
      "zh": "参与, 参加"
    },
    {
      "en": "take part in",
      "zh": "参加,参与"
    },
    {
      "en": "take on",
      "zh": "承担;呈现;具有;流行;接纳;雇用;穿上"
    },
    {
      "en": "take up",
      "zh": "拿起;开始从事"
    },
    {
      "en": "take effect",
      "zh": "生效;起作用"
    },
    {
      "en": "take off",
      "zh": "起飞;脱下;离开"
    },
    {
      "en": "take a look",
      "zh": "看一下"
    },
    {
      "en": "take out",
      "zh": "v. 取出;去掉;出发;抵充"
    },
    {
      "en": "take into",
      "zh": "考虑到;说服"
    },
    {
      "en": "take in",
      "zh": "接受;理解;拘留;欺骗;让…进入;改短"
    },
    {
      "en": "take seriously",
      "zh": "重视;认真对待…"
    },
    {
      "en": "take away",
      "zh": "带走,拿走,取走"
    },
    {
      "en": "take a look at",
      "zh": "[口]看一看;检查"
    },
    {
      "en": "take over",
      "zh": "接管;接收"
    },
    {
      "en": "take for granted",
      "zh": "认为…理所当然"
    },
    {
      "en": "take the lead",
      "zh": "v. 带头;为首"
    },
    {
      "en": "take charge of",
      "zh": "接管,负责"
    },
    {
      "en": "take good care",
      "zh": "好好照顾;珍重"
    }
  ],
  "syn": [
    {
      "pos": "vt.",
      "ws": [
        "carry",
        "adopt",
        "have",
        "eat",
        "assume"
      ],
      "zh": "拿,取;采取;吃;接受"
    },
    {
      "pos": "vi.",
      "ws": [
        "pick up",
        "get access to"
      ],
      "zh": "拿;获得"
    }
  ],
  "der": {
    "root": "take",
    "rels": [
      {
        "pos": "adj.",
        "words": [
          {
            "word": "taking",
            "tran": "可爱的;迷人的;会传染的"
          }
        ]
      },
      {
        "pos": "n.",
        "words": [
          {
            "word": "taking",
            "tran": "取得;捕获;营业收入"
          },
          {
            "word": "taker",
            "tran": "接受者;接受打赌的人;捕获者"
          }
        ]
      },
      {
        "pos": "v.",
        "words": [
          {
            "word": "taking",
            "tran": "拿;捕捉;夺取(take的ing形式)"
          }
        ]
      }
    ]
  },
  "sen": [
    {
      "en":"I\'ll take my coat upstairs. Shall I take yours, Roberta?",
      "zh": "我将把我的外套拿到楼上去。要我把你的拿上去吗,罗伯塔?"
    },
    {
      "en":"She can\'t take criticism.",
      "zh": "她受不了批评。"
    },
    {
      "en":"We take the \'Express\'.",
      "zh": "我们订阅的是《快报》。"
    }
  ]
}

json的主要问题是显示部分和数据结构不一致,要单独处理,大大增加词典制作者的要求和所需时间,所以推广会有很大问题,和yomitan的问题一样。

1 个赞

我之所以主张采用 JSON 的前提,是因为它是当下和可见的未来里最通用、最自然的数据交互格式。XML 的结构比 JSON 更复杂、更自由,从 XML 中稳定地抽取出结构化的 JSON 往往非常困难,最终这里面的复杂度会全部压到软件开发者这边。

确实,使用 JSON 会显著提高词典制作者的门槛,也会拉长制作时间,但像 Yomitan 常用的词典本来也不超过十本,能覆盖日常使用的需求就足够了,新格式完全没必要和 MDX 一起肩并肩走向旧时代。

使用 JSON 一定会损失表现力,但使用 XML 同样也会,你始终没法讨好那些使用 HTML 的人,用户的使用惯性是很难改变的,不如尽早放弃他们,跳出存量市场接触新的目标用户。

XML就是当下和可见的未来里最通用、最自然的文档数据交互格式,无可置疑

Word 词典 网页 都是XML或者HTML,词典更是需要XML,如果能是固定的JSON轻松表达的,那一定不会是精品,甚至连可用的都不是

如果非要用JSON,韦氏词典就是很好的例子

非要JSON,结果完全乱套,人看懂都难,到了文本彻底摆烂了,字符串里再套上自己设计的土味md,经过n层循环终于得到“text”了,拿到字符串之后还得照着xml去解,何尝不是舍近求远?

"{bc}any of several domesticated {dx_def}see {dxt|domesticate:1||2}
                {\/dx_def} or wild {d_link|gallinaceous|gallinaceous}
                birds {dx}compare {dxt|guinea fowl||}, {dxt|jungle fowl||}{\/dx}"
"Middle English {it}foul{\/it}, from Old English {it}fugel{\/it};
    akin to Old High German {it}fogal{\/it} bird, and probably to Old
    English {it}fl\u0113ogan{\/it} to fly
    {ma}{mat|fly|}{\/ma}"

这只能说是披着JSON的外衣设计了新的“XML”

错综复杂的文档JSON是无解的,有解也是个无底洞,最终一定会演变成用无数补丁堆砌出来的畸形产物

HTML 词典格式已经是过时产物了。只有转向 JSON,才能激活整个 AI 与学习应用生态,让词典不再是孤立的应用,而是一个能被各种工具自由调用的平台。