如何设计一个自用的压缩率大的词典格式

amob · 2024 年12 月 26 日 12:25

谷歌和必应我站搜索权重极高，词典关键词基本第一第二个，百度权重反而比较低。

last_idol · 2024 年12 月 26 日 12:27

kiwix 的用户也不多，我看他们聊天区里的主要用户是印度、非洲和南美，平时都没什么人在线。

spoony · 2024 年12 月 26 日 12:27

brotli这个算法本身就是来自于训练结果。我猜测也允许使用用户字典。这样的话，可能和zstd应该差不多吧。但是专门训练一个用户字典，恐怕也是挺麻烦的。

last_idol · 2024 年12 月 26 日 12:28

自己训练简单，zstd 自带训练工具，这个工具会预先读一遍要压缩的文本文件，然后生成自定义的压缩字典。（问题还是没有兼容的词典软件。

hua · 2024 年12 月 26 日 12:36

DictTango 自己的格式应该用的是 ZSTD

spoony · 2024 年12 月 26 日 12:36

Brotli can indeed utilize user-defined dictionaries to enhance its compression capabilities. This feature allows Brotli to achieve better compression ratios, especially when the dictionary is tailored to the specific content being compressed. Shared compression dictionaries can significantly improve the efficiency of Brotli by including common patterns or terms that frequently appear in the data, which helps in reducing the overall size of the compressed output.
Regarding the comparison between Brotli and Zstandard (Zstd) in terms of compression ratios for specific datasets like Wikipedia, Brotli generally achieves higher compression ratios than Zstd, particularly for text-based content. For example, Brotli is known to compress HTML files around 20% smaller and JavaScript files about 15% smaller compared to Gzip, and it often outperforms Zstd in terms of compression density for similar datasets.
However, while Brotli may provide better compression ratios, Zstd is typically faster in bo

hua · 2024 年12 月 26 日 12:36

你可以试试，Brotli 我测试不大行。实践最好。

last_idol · 2024 年12 月 26 日 12:39

这个对比是在 ZSTD 没有自定义压缩字典的情况下的，ZSTD 的上限更高。

spoony · 2024 年12 月 26 日 12:42

上限更高的话，那看来这个更好。有没有相关的参考文献？

其实对每一本字典都自带一个压缩字典也没什么。这样的话制作成通用软件也可以。

last_idol · 2024 年12 月 26 日 12:50

自己测试吧，我做过很多测试了。

spoony · 2024 年12 月 26 日 12:58

我问了一下ChatGPT, zstd上限确实比brotli更高。wikipedia这种情况下明显应该使用zstd + 训练后的压缩词典。

meandmyhomies · 2024 年12 月 26 日 13:26

看看这个，专门比较压缩文字的210多个算法，不需要发明新格式

https://mattmahoney.net/dc/text.html

zheshijie · 2024 年12 月 26 日 13:39

这个牛逼，要看好几天