En.wiktionary.org 20260327 更新

It is on-demand and therefore not periodic and stale.

There are still a small number of duplicates (older versions of the same article)

I have a way to drop them as part of data cleansing.

1 Like

How about parsing data from the ZIM files? They are updated regularly.

I think zim files dropped all quotations and may not have images and audio links. I do remember converting the wiktionary from zim but that was what I found.

1 Like

I have used (title,id,url) as primary key to keep only latest date_modified as a stringent dedupe criteria. The result is a whopping 14% reduction (or duplication) rate by volume.

This means the snapshot chunking algorithm is really faulty (in retrieving data regardless if it was the current version) A full 14% of its data by volume is duplicate or stale.

Correction: zh.wikipedia has approx 14% stale duplicate content.

en.wikipedia so far has 0% stale duplicate content. They fixed this bug as of 23rd. Perhaps zh.wikipedia downloaded on Oct 11th didn’t get the bug fix.

1 Like

感谢分享?怎么打开例句?

哪位可以清理一下里面的空序号。

查某个词条,就往下面找,就会找到例句(注意:不是每个词条都有例句)。

I thought the html dump is not available after 24 March 2025:

What is the data source from which you created the mdd/mdx of English and Chiness Wikipedia?

That service was a convenience storage server around the commercial site serving the snapshots and live data maybe for profit (commercial domain).

Wikimedia Enterprise - APIs for AI, Search & Knowledge Graphs

So you paid to that service?

I didn’t, but they have paying customers. The data service is capped for free users.

1 Like

The AI replacement for human written Wikipedia is here:

2 Likes

I guess free account is already good enough for our goals

15 snapshot requests, even if requests failed, which is likely for large downloads, so 15 is only very low. plus it is IP based.

1 Like

修复了多余的序号问题。

另外去掉多余的Category: Prefix

2 Likes

many thanks

更新了css(小mdd)
MDict_2KcrjSwd89

3 Likes

新数据/新皮:

不再以MDictPC为基准。

3 Likes

可以默认把例句收纳起来吗?

应该是下一次更新

1 Like