maybe deduplication will save you 2mb, is it worth the trouble ?
that first paragraph says everything.
tbh I’ve shared in many places, and people weren’t that jerk about it, in the Latin community (argentina for example) they were ok with it.
this forum was literally dead for Japanese content for a long time, and i was posting japanese dictionaries that i didn’t even had the obligation of posting, because i could literally keep them in private, but instead optioned to post them because i somehow wanted to contribute;
and this is what i get ? lmao
it’s not hard to understand , but @hua acts like a child or smth, i was expecting a more serious resolve from a forum admin.
go ahead and do whatever the hell you want with my account. I don’t need this forum, i was posting because there are a few people who want dictionaries and not “bytes”
Only anonymization is allowed. Maybe I should not close this topic. I want to watch your show.
this is called bias.
unilateral moderation , go ahead if you got the guts and delete my account or something like that.
you don’t have the guts to do it?
论坛里还有日本朋友吗?
为什么会对词典的改进讨论会被视为无礼啊?
怎样表达才能是讨论词典而不影响作者的尊严?
He’s Portuguese if memory serves me.
怪谁,怪中国人不说敬语
日本人说敬语。
have you said thank you once?
lmao
写了个粗糙的 python 脚本对解包后的 txt 处理了一下:
with open('旺文社漢字典 第四版.mdx.txt', 'r', encoding='utf-8') as f:
d = {}
a = []
i = 0
for _ in f:
line = _.strip()
if not line.startswith('<'):
hw = line
elif line.startswith('<html>'):
if line not in d:
d[line] = hw
elif hw != d[line]:
at = f'{hw}\n@@@LINK={d[line]}\n</>\n'
if at not in a:
a.append(at)
i += 1
if i % 3000 == 0:
print(i, end='\r')
with open('output.txt', 'w', encoding='utf-8') as f:
f.writelines(f'{v}\n{k}\n</>\n' for k, v in d.items())
print(len(d), 'entries ...')
f.writelines(a)
print(len(a), '@@@LINKs ...')
得到 52656 个条目和 10444 个跳转链接(@@@LINK)
原 txt 文件 340MB,处理后 ➔ 85MB mdx.txt.zip (8.0 MB)
原 mdx 文件 23.2MB,处理后 ➔ 10.1MB output.mdx (10.1 MB)
请帮我看看这个脚本的处理方式是否有问题?
give feedback = don’t like someone’s work and be jerk about it
没啥问题。
As you wish.
真是个活宝。发布词典文件,内容有重复,别人指出问题给出反馈,发布者应该感谢才对,因为这有利于改进品质,利益每一个人。但楼主的行为方式却好像这种有益的建议是冒犯一样,而且自我辩护的办法是文过饰非,答非所问,把论坛的人当傻子一样。这样的戏码在另一个我互动过的帖子里就已经上演过了。
论坛与人交流,别人和善友好地指出明显的错误和问题,在任何文化里都不是冒犯、无礼和不尊重。用汉语的人虽然很多都被围在墙内,但并不是所有人都是土鳖,国际性的大社区,比如twitter、reddit、facebook等不少人也是日常看的,并非不明白交流时的礼仪。
不是论坛的问题,也不是文化差异的问题,只是他个人自己的问题罢了,既然这样,就不要在公共论坛和人互动了。
其实就是文化差异,也有社交礼仪的问题。如果是我来描述这个问题,我会这样回复:
Hi OP,
Really appreciate you sharing this!
While looking through the .mdx
file, I noticed what appears to be some duplicate content in a few places. Is this an issue with the raw data, or did something go wrong during processing?
Thanks again for your work!
前后两段感谢是开源社区向作者提问的时候用的常见模板,在 Github & Reddit 很多人应该都见过,我刚开始也无法理解他们的回复为什么要这么谦卑,后来才明白这不仅是社交礼仪的原因,有些作者是习惯站在施舍者的角度看问题的,你说的别人指出问题,作者应该感谢,这种逻辑对他们是不成立的,反而是这种行为和他们的预期不符,你必须老老实实的站在被施舍者的角度上,才有可能和他们正常沟通。
不得不说github上issue很多这么回的。。嘿你说的还真有道理。
我都关浏览器了,还是想来再说一句,确实英语社区很喜欢在前面在后面加一句 Thanks for your work! 吼。