且将新火试新茶制定新的词典格式而不是继续使用mdx

zwyyy456 · 2026 年2 月 27 日 02:30

赞同，我认为新的词典格式应该基于 json，词典本就贵精不贵多。

last_idol · 2026 年2 月 27 日 02:32

主做 epub，兼容词典需求，类似 kindle 的 kfx 格式，是最佳选择。实际上可以直接在 epub 的目录中增加一个 entries 目录，把词条放这个目录下就可以了，其他不需要变。

我这里只是用 epub 举例，实际应该用 epub 的架构，自己重新设计文档的 ast。

词典的风口已经过去了，再做一个 html 的词典应用就是大步走向旧社会，从产品角度讲没有做的必要，现有的 mdx 已经满足了。

zwyyy456 · 2026 年2 月 27 日 02:38

我认为基于 html 的 mdx 词典的最大问题在于，html 格式过于灵活自由，各个词典制作者采用的 class、html 结构都不一致，想要做义项的解析等工作，需要一本本词典做适配，没有一个统一的方案。

u3842 · 2026 年2 月 27 日 03:43

mdict最逆天的，必须解决的：

文本没规范化，不仅是词条，对于文件(夹)名会有更严重的问题，文件往MacOS过一遍都成NFD规范化的了，这些打包进去坑巨多
MDict：café和café是不同的词条，查不到mdx v3里的café
词典文件资源文件搭配的css js 封面图等这些文件名没个明确的规范，更不用说优先级
@@@LINK= 逻辑完全乱套，是共享历史还是重复显示？link的带#的就是词条还是词条定位到id？照着原作者的说明意思是目标词条就一个，但很多词典不止一个
后人埋下的坑：把明确说明的“使用<a href="sound……的形式，仅支持……”当摆设，遇到这些还得额外适配。这个词典软件的listener还可能跟词典里的冲突
后人埋下的大坑：早期mdx并未开源，文件格式官方也没公开，照着逆向然后兼容的并不可靠，像mdx v2词条内容后面固定的\r\n终止符截掉就行，有的只有\n终止符可能导致截掉html的>，甚至一些mdx还无中生有，像big5编码

其次再考虑新的功能、内容结构

像mdx只支持纯英文排序和忽略大小写，其他语言不可用，虽然mdx v3有了新的，但这个怕是坑更多。没有“辅助索引(对用户来讲)”，@@@LINK可能救了点，但也造成词典里的内容更乱了……

u3842 · 2026 年2 月 27 日 04:05

这个我不完全认同，一般AI来“说”JSON比XML更能胜任，但喂给AI恐怕不是这样的，JSON结构复杂起来就容易遇到}}]}]，xml或者html里的一段文本里一些词的加粗斜体链接表示起来连贯又简单，如果非要用JSON又得套层}]还额外带若干标点，XML标签闭合而且标签本身带有强烈的语义信息，像这个朗文的光take词条的take up短语就这么多，读到</PhrVbEntry>知道这个动词短语结束了，接着下一个<PhrVbEntry>或者</Entry>整个词条结束了，JSON不仅这些都没有而且还不支持注释

  <PhrVbEntry id="u2fc098491a42200a.6e2b450a.1150446158e.28f2">
    <Head>
      <PHRVBHWD>take up</PHRVBHWD>
      <POS>phr v</POS>
    </Head>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.28f7">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to become interested in a new activity and to spend time doing it</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.28fb">Roger took painting up for a while, but soon lost interest.</EXAMPLE>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.28fc">
      <LEXUNIT>take sth up</LEXUNIT>
      <DEF>to start a new job or have a new responsibility</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.28ff">Peter will take up the management of the finance department.</EXAMPLE>
        <ColloExa>
          <COLLO>take up a post/a position/duties etc</COLLO>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2902">The headteacher takes up her duties in August.</EXAMPLE>
        </ColloExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2903">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>if you take up a suggestion, problem, complaint etc, you start to do something about it</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2907">Now the papers have taken up the story.</EXAMPLE>
        <GramExa>
          <PROPFORMPREP>with</PROPFORMPREP>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.290a">The hospital manager has promised to <COLLOINEXA>take the matter up</COLLOINEXA> with the member of staff involved.</EXAMPLE>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.290c">I am still very angry and will be taking it up with the authorities.</EXAMPLE>
        </GramExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.290d">
      <LEXUNIT>take up sth</LEXUNIT>
      <DEF>to fill a particular amount of time or space</DEF>
      <ExpandableInformation>
        <GramExa>
          <PROPFORM>be taken up with sth</PROPFORM>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2912">The little <COLLOINEXA>time</COLLOINEXA> I had outside of school was <COLLOINEXA>taken up</COLLOINEXA> with work.</EXAMPLE>
        </GramExa>
        <ColloExa>
          <COLLO>take up space/room</COLLO>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2917">old books that were taking up space in the office</EXAMPLE>
        </ColloExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2918">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to accept a suggestion, offer, or idea</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.291b">Rob <COLLOINEXA>took up the invitation</COLLOINEXA> to visit.</EXAMPLE>
        <ColloExa>
          <COLLO>take up the challenge/gauntlet</COLLO>
          <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.291f">Rick took up the challenge and cycled the 250-mile route alone.</EXAMPLE>
        </ColloExa>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2920">
      <LEXUNIT>take up sth</LEXUNIT>
      <DEF>to move to the exact place where you should be, so that you are ready to do something</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.2923">The runners are <COLLOINEXA>taking up</COLLOINEXA> their <COLLOINEXA>positions</COLLOINEXA> on the starting line.</EXAMPLE>
      </ExpandableInformation>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.2926">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to make a piece of clothing shorter</DEF>
      <OPP>let down</OPP>
    </Sense>
    <Sense id="u2fc098491a42200a.6e2b450a.1150446158e.292b">
      <LEXUNIT>take sth ↔ up</LEXUNIT>
      <DEF>to continue a story or activity that you or someone else had begun, after a short break</DEF>
      <ExpandableInformation>
        <EXAMPLE id="u2fc098491a42200a.6e2b450a.1150446158e.292f">I’ll take up the story where you left off.</EXAMPLE>
      </ExpandableInformation>
    </Sense>
  </PhrVbEntry>

last_idol · 2026 年2 月 27 日 04:22

llm 的函数调用输入输出全是 json，xml 太复杂了，你自己都不一定能提取到正确的 json。 xdxf 已经证明了 xml 的失败，没必要再试了。

YYDWHY · 2026 年2 月 27 日 06:24

不懂就问：日本的EPWING词典是什么格式的？

u3842 · 2026 年2 月 27 日 09:00

这俩结构就是两个极端

就调用个接口，不涉及复杂多变的嵌套，结构固定，key无序，不涉及富文本，大可不必用XML，像楼上那个demo，发给后端的就俩key，回来的就一个[]，没有比JSON更简单的了

JSON适合处理规整的数据，结构是绝对层级化，但到了复杂的人阅读的文档，很难搞，XML或者HTML天生就是干这个的。JSON的key无序，先来个[]然后for里if再若干个elseif后下一层；一旦遇到富文本，用JSON就是灾难，然道要再套一层[{}, {}……]然后for里又if？像这个词典的某些固定结构就没有固定的层级或位置，可能在释义里，在在释义的某个用法里，在释义外……

这些在网页上用css实现 I waved, but he didn’t take any notice (=pretended not to notice). British English 极其简单，到了JSON又得套娃，这还只是一个例句

<EXAMPLE id="u2f……">I waved, but he <COLLOINEXA>didn’t take any notice</COLLOINEXA><GLOSS>pretended not to notice</GLOSS>.<GEO>BrE</GEO></EXAMPLE>

整成JSON也不是不行，但代价……不过有轮子专门干这个

{
    "type": "EXAMPLE",
    "content": [
        {
            "type": "text",
            "text": "I waved, but he "
        },
        {
            "type": "COLLOINEXA",
            "text": "didn’t take any notice"
        },
        {
            "type": "GLOSS",
            "text": "pretended not to notice"
        },
        {
            "type": "GEO",
            "text": "BrE"
        },
        {
            "type": "text",
            "text": "."
        }
    ]
}

u3842 · 2026 年2 月 27 日 09:10

内容类似XML或者HTML，但以二进制形式存在，估计受限于当时的硬件，用“控制码”代替例如<b>``</b>这种。还有点阵图字符、绝对地址的跳转，这些很有年代感

u3842 · 2026 年2 月 27 日 14:37

为这个稍微写了点css，很顺利，对于结构性强的文档必须得靠XML或者HTML，用JSON只会更复杂，标签本身带有强烈的语义信息，写看见带</> 就知道哪端段结束了，比满是div span的清晰多了，更是远好于一堆括号

    <sense-data id="u2fc098491a42200a.6e2b450a.1150446158e.2641">
      <sign-post>action</sign-post>
      <grammar-label>T</grammar-label>
      <sense-definition>used with a noun instead of using a verb to describe an action. For example, if you take a walk, you walk somewhere</sense-definition>
      <expandable-information>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2646">Would you like to take a look?</example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2647">Mike’s just taking a shower.</example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2648">Sara took a deep breath.</example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2649">I waved, but he <collocation-highlight>didn’t take any notice</collocation-highlight><context-gloss>pretended not to notice</context-gloss>.<GEO>BrE</GEO></example-sentence>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.264d">Please <collocation-highlight>take a seat</collocation-highlight><context-gloss>sit down</context-gloss>.</example-sentence>
        <ColloExa>
          <collocation-pattern>take a picture/photograph/photo</collocation-pattern>
          <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2652">Would you mind taking a photo of us together?</example-sentence>
        </ColloExa>
      </expandable-information>
    </sense-data>
    <sense-data id="u2fc098491a42200a.6e2b450a.1150446158e.266a">
      <sign-post>remove</sign-post>
      <grammar-label>T</grammar-label>
      <sense-definition>to remove something from a place</sense-definition>
      <Thesref>
        <Crossrefto refid="u2fc098491a42200a.262cc60a.117ee7c0f66.-1683">
          <REFHWD>steal</REFHWD>
          <REFHOMNUM>1</REFHOMNUM>
        </Crossrefto>
      </Thesref>
      <expandable-information>
        <grammar-example>
          <pattern-form>take sth off/from etc sth</pattern-form>
          <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2670">Take your feet off the seats.</example-sentence>
          <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2671">Someone’s taken a pen from my desk.</example-sentence>
        </grammar-example>
        <example-sentence id="u2fc098491a42200a.6e2b450a.1150446158e.2672">Police say money and jewellery were taken in the raid.</example-sentence>
        <Crossref>
          <Crossrefto refid="u2fc098491a42200a.6e2b450a.1150446158e.27e5">
            <REFHWD>take </REFHWD>
          </Crossrefto>
        </Crossref>
      </expandable-information>
    </sense-data>

如果内容被规范的话CSS就很简单，留给普通用户的自定义空间也更多，按板块提取信息也简单，像提取例句、搭配、某个搭配下的例句、动词短语、这个词的某个动词短语的例句……都很好说

last_idol · 2026 年2 月 28 日 03:07

韦氏官网的词条页面是 json 格式的数据源生成的，从词条结构上说在英语词典里韦氏词典应该是比较复杂的了，我不觉得 json 会有什么严重的问题，富文本的实现也可以参考韦氏官网自定义的语法标签，比如 {it}{/it}。

android 17 会正式支持应用级别的函数调用，输入输出都是 json。我估计所有词典应用都需要实现这个功能，到时候 mdx 的问题就会放更大了。

u3842 · 2026 年2 月 28 日 08:10

不可能是JSON，大多结构复杂的词典底层都是XML或类似物，XML都能转成JSON，但怎么转得商量好，像这个虽然转到了JSON，但处理起来很麻烦，异构数组使得JSON本该清晰的逻辑混乱、解析更费劲，扩展性差，转到对象数组嵌套能更清晰但层数更多，文本用字符串再来个{wi}不还是XML类似物吗{\/wi}？文字带个链接这种很简单的需求到这得复杂多少？（下面的json和xml都来自 https://dictionaryapi.com/products/json 经格式化，然后套了一层）

[
  [
    "sense",
    {
      "sn": "1 a",
      "sgram": "T\/I",
      "dt": [
        [
          "text",
          "{bc}to shut down and restart (a computer or program) "
        ],
        [
          "vis",
          [
            {
              "t": "\u2026 the annoyance of having to {wi}reboot{\/wi} thecomputer to switch operating systems \u2026",
              "aq": {
                "auth": "Robert Weston"
              }
            },
            {
              "t": "If anything ever happens to the original drive, you can{wi}reboot{\/wi} using the cloned drive and be up and runningin minutes.",
              "aq": {
                "auth": "Dan Frakes"
              }
            }
          ]
        ]
      ]
    }
  ],
  [
    "sense",
    {
      "sn": "b",
      "sgram": "I",
      "dt": [
        [
          "text",
          "{bc}to start up again after closing or shutting down {bc}toboot up again "
        ],
        [
          "vis",
          [
            {
              "t": "waiting for a computer\/program to {wi}reboot{\/wi}"
            }
          ]
        ]
      ]
    }
  ]
]

<sseq>
  <sense>
    <sn>1 a</sn>
    <sgram>T/I</sgram>
    <dt>{bc}to shut down and restart (a computer or program)
      <vis>
        <vi>
          <t>… the annoyance of having to {wi}reboot{/wi} the computerto switch operating systems … </t>
          <aq>
            <auth>Robert Weston</auth>
          </aq>
        </vi>
        <vi>
          <t>If anything ever happens to the original drive, you can {wi}reboot{/wi}using the cloned drive and be up and running in minutes. </t>
          <aq>
            <auth>Dan Frakes</auth>
          </aq>
        </vi>
      </vis>
    </dt>
  </sense>
  <sense>
    <sn>b</sn>
    <sgram>I</sgram>
    <dt>{bc}to start up again after closing or shutting down {bc}toboot up again
      <vis>
        <vi>
          <t>waiting for a computer/program to {wi}reboot{/wi}</t>
        </vi>
      </vis>
    </dt>
  </sense>
</sseq>

u3842 · 2026 年2 月 28 日 08:29

本来词典这种结构复杂的就应该用XML，转到JSON，解析起来跟手写解析XML没什么区别，除非舍弃些结构或信息，也并不清晰，这个更是。sn dt vis 这种阴间命名，{bc} {wi}这种妥协，还有异构数组就恶心人的玩意，终于用上对象了还再套一层，这种屎山没人愿意解

u3842 · 2026 年2 月 28 日 08:37

从有d词d整来的，整理出一部分，结构是清晰，但内容……

{
  "phonuk": "teɪk",
  "phonus": "teɪk",
  "defn": [
    [
      "v.",
      "携带，拿走；带去，引领；使达到，提升；拿，取；移走，拿开；偷走，误拿；取材于，收集；攻占，控制；选中，买下；订阅（报纸等）；吃，服用；减去；记录，摘录；照相，摄影；量取，测定；就（座）；以…...为例；接受，收取；接纳，接待（顾客、患者等）；遭受，经受；忍受，容忍；（以某种方式）对待，处理；理解，考虑；误以为；赢得（比赛、竞赛等）；产生（感情），持有（看法）；采取（措施），采用（方法）；做，拥有；采用（形式），就任（职位）；花费，占用（时间）； 需要，要求；使用；穿（特定尺码的鞋或衣物）；容纳；授课；学习，选修（课程）；参加（考试或测验）；走（路线），乘坐（交通工具）；跨过，跳过；踢，掷；举行投票，进行民意调查；成功，奏效；（语法）需带有（某种结构）"
    ],
    [
      "n.",
      "（一次拍摄的）镜头，场景；收入量；看法，态度；（印刷）一次排版量"
    ]
  ],
  "form": [
    {
      "fn": "复数",
      "en": "takes"
    },
    {
      "fn": "第三人称单数",
      "en": "takes"
    },
    {
      "fn": "现在分词",
      "en": "taking"
    },
    {
      "fn": "过去式",
      "en": "took"
    },
    {
      "fn": "过去分词",
      "en": "taken"
    }
  ],
  "phr": [
    {
      "en": "take care of oneself",
      "zh": "照顾自己；颐养"
    },
    {
      "en": "take part",
      "zh": "参与， 参加"
    },
    {
      "en": "take part in",
      "zh": "参加，参与"
    },
    {
      "en": "take on",
      "zh": "承担；呈现；具有；流行；接纳；雇用；穿上"
    },
    {
      "en": "take up",
      "zh": "拿起；开始从事"
    },
    {
      "en": "take effect",
      "zh": "生效；起作用"
    },
    {
      "en": "take off",
      "zh": "起飞；脱下；离开"
    },
    {
      "en": "take a look",
      "zh": "看一下"
    },
    {
      "en": "take out",
      "zh": "v. 取出；去掉；出发；抵充"
    },
    {
      "en": "take into",
      "zh": "考虑到；说服"
    },
    {
      "en": "take in",
      "zh": "接受；理解；拘留；欺骗；让…进入；改短"
    },
    {
      "en": "take seriously",
      "zh": "重视；认真对待…"
    },
    {
      "en": "take away",
      "zh": "带走，拿走，取走"
    },
    {
      "en": "take a look at",
      "zh": "[口]看一看；检查"
    },
    {
      "en": "take over",
      "zh": "接管；接收"
    },
    {
      "en": "take for granted",
      "zh": "认为…理所当然"
    },
    {
      "en": "take the lead",
      "zh": "v. 带头；为首"
    },
    {
      "en": "take charge of",
      "zh": "接管，负责"
    },
    {
      "en": "take good care",
      "zh": "好好照顾；珍重"
    }
  ],
  "syn": [
    {
      "pos": "vt.",
      "ws": [
        "carry",
        "adopt",
        "have",
        "eat",
        "assume"
      ],
      "zh": "拿，取；采取；吃；接受"
    },
    {
      "pos": "vi.",
      "ws": [
        "pick up",
        "get access to"
      ],
      "zh": "拿；获得"
    }
  ],
  "der": {
    "root": "take",
    "rels": [
      {
        "pos": "adj.",
        "words": [
          {
            "word": "taking",
            "tran": "可爱的；迷人的；会传染的"
          }
        ]
      },
      {
        "pos": "n.",
        "words": [
          {
            "word": "taking",
            "tran": "取得；捕获；营业收入"
          },
          {
            "word": "taker",
            "tran": "接受者；接受打赌的人；捕获者"
          }
        ]
      },
      {
        "pos": "v.",
        "words": [
          {
            "word": "taking",
            "tran": "拿；捕捉；夺取（take的ing形式）"
          }
        ]
      }
    ]
  },
  "sen": [
    {
      "en":"I\'ll take my coat upstairs. Shall I take yours, Roberta?",
      "zh": "我将把我的外套拿到楼上去。要我把你的拿上去吗，罗伯塔？"
    },
    {
      "en":"She can\'t take criticism.",
      "zh": "她受不了批评。"
    },
    {
      "en":"We take the \'Express\'.",
      "zh": "我们订阅的是《快报》。"
    }
  ]
}

wynick27 · 2026 年2 月 28 日 09:13

json的主要问题是显示部分和数据结构不一致，要单独处理，大大增加词典制作者的要求和所需时间，所以推广会有很大问题，和yomitan的问题一样。

last_idol · 2026 年2 月 28 日 10:17

我之所以主张采用 JSON 的前提，是因为它是当下和可见的未来里最通用、最自然的数据交互格式。XML 的结构比 JSON 更复杂、更自由，从 XML 中稳定地抽取出结构化的 JSON 往往非常困难，最终这里面的复杂度会全部压到软件开发者这边。

确实，使用 JSON 会显著提高词典制作者的门槛，也会拉长制作时间，但像 Yomitan 常用的词典本来也不超过十本，能覆盖日常使用的需求就足够了，新格式完全没必要和 MDX 一起肩并肩走向旧时代。

last_idol · 2026 年2 月 28 日 10:31

使用 JSON 一定会损失表现力，但使用 XML 同样也会，你始终没法讨好那些使用 HTML 的人，用户的使用惯性是很难改变的，不如尽早放弃他们，跳出存量市场接触新的目标用户。

u3842 · 2026 年2 月 28 日 12:52

XML就是当下和可见的未来里最通用、最自然的文档数据交互格式，无可置疑

Word 词典网页都是XML或者HTML，词典更是需要XML，如果能是固定的JSON轻松表达的，那一定不会是精品，甚至连可用的都不是

如果非要用JSON，韦氏词典就是很好的例子

非要JSON，结果完全乱套，人看懂都难，到了文本彻底摆烂了，字符串里再套上自己设计的土味md，经过n层循环终于得到“text”了，拿到字符串之后还得照着xml去解，何尝不是舍近求远？

"{bc}any of several domesticated {dx_def}see {dxt|domesticate:1||2}
                {\/dx_def} or wild {d_link|gallinaceous|gallinaceous}
                birds {dx}compare {dxt|guinea fowl||}, {dxt|jungle fowl||}{\/dx}"
"Middle English {it}foul{\/it}, from Old English {it}fugel{\/it};
    akin to Old High German {it}fogal{\/it} bird, and probably to Old
    English {it}fl\u0113ogan{\/it} to fly
    {ma}{mat|fly|}{\/ma}"

这只能说是披着JSON的外衣设计了新的“XML”

u3842 · 2026 年2 月 28 日 13:00

错综复杂的文档JSON是无解的，有解也是个无底洞，最终一定会演变成用无数补丁堆砌出来的畸形产物

last_idol · 2026 年2 月 28 日 13:04

HTML 词典格式已经是过时产物了。只有转向 JSON，才能激活整个 AI 与学习应用生态，让词典不再是孤立的应用，而是一个能被各种工具自由调用的平台。

且将新火试新茶 制定新的词典格式而不是继续使用mdx

且将新火试新茶制定新的词典格式而不是继续使用mdx