Oubunsha oukogo 旺文社古語辞典第十版

anon38366553 · 2024 年6 月 18 日 04:03

does someone has that dictionary?

anon38366553 · 2025 年5 月 1 日 03:09

finally found

anon38366553 · 2025 年5 月 1 日 16:50

data extracted from the scanned pdf:

now i think the hard part will be the regex

mixivivo · 2025 年5 月 2 日 09:01

what is the best OCR software for Japanese? or which software did you use to extract data from the 旺文社古語辞典? thanks.

anon38366553 · 2025 年5 月 2 日 15:02

it was already OCRed.

mixivivo · 2025 年5 月 2 日 16:22

Yeah, I see. “Producer: ABBYY finereader 12”. It seems there are many errors in the extracted text.

anon38366553 · 2025 年5 月 2 日 16:23

there aren’t errors.
I just need to normalize the utf8 data

mixivivo · 2025 年5 月 2 日 16:38

討の帰途、碓日(うすい)の坂(「古事記」では足柄(あしがら)峠)で、妃の弟
橘媛(おとたちばなひめ)が自分のために投身したのを悲しみ、「吾妻

anon38366553 · 2025 年5 月 2 日 16:48

selecting with the mouse will (obviously) lead to imprecise data, that’s why i created a python code to extract the data from the pdf

mixivivo · 2025 年5 月 2 日 16:56

討の帰途、碓日(うすい)の坂(「古事記」では足柄(あしがら)峠)で、妃の弟橘媛(おとたちばなひめ)が自分のために投身したのを悲しみ、「吾妻

和你说的这些无关，对比一下字符就知道了，因为图像底本不佳，也可能因为finereader性能有限，识别结果有不少文字错误。这种错误率的词典基本是没法用的，没必要在它上面浪费时间，建议寻找更好的图像底本。

anon38366553 · 2025 年5 月 2 日 19:24

when having scanned dictionary , the definition of a word can end in another page, that’s a regular thing.

if you pay attention to what I’ve said before, the right way of getting things in the correct order would be using regex for formatting.
as for the 「吾妻 at the end, it’s because this text editor doesn’t show the rest of the line, ofc there’s more content in there, or do you think that a single page would have only that little amount of text? don’t play the dumb.

putting the text side by side just reveals that the extraction was successful, so you pasted a piece, didn’t noticed that it was fully correct, and judged that there wasn’t “more” ? even tho it’s obvious that the line had more content in the picture? …

translation of your chinese text:
It has nothing to do with what you said, comparing the characters shows that there are quite a few text errors in the recognition results because of the poor image substrate, and possibly because of the limited performance of the finereader. A dictionary with this error rate is basically unusable, there is no point in wasting time on it, and it is advisable to look for better image substrates.

wasting time on it

分からん野郎だな。
stop inferring too much about the whole thing by just looking at a single picture.

anon38366553 · 2025 年5 月 2 日 19:32

that’s the wrong page tho, smart ass.

the code will count the cover of the pdf as page 1.

anon38366553 · 2025 年5 月 2 日 19:44

also, where did you saw that?
there isn’t anything to do with it.
the pdf already had text.

proof:

mixivivo · 2025 年5 月 2 日 20:54

So you can’t see the clear difference in the text in the following two places?

古事記 ↔ 古琪記

峠妃橘溝 ↔ 峠)で、妃の弟橘媛

That’s fine, you like wasting your time, go ahead and waste it, but I don’t have that much time to spare.

anon38366553 · 2025 年5 月 2 日 22:43

lmfao, are you trolling?
you can’t identify the character ?

anon38366553 · 2025 年5 月 2 日 22:50

i’ll assume you’re doing it on purpose (aka trolling).
there’s no way that someone cant recognize this.

can you explain how did you saw a “琪” in there???

also, i mentioned about “text normalization”.
but i wont explain, people who are aware of mecab and other tools know that text can be easily repair by using this.

make a favor for all of us and keep out.
邪魔するな

anon38366553 · 2025 年5 月 2 日 22:57

i’m also working at a IA that can correct japanese phrases too:

exported the model to onnx, and will work at it (already was planned).
what you call “waste of time” , we call work.
stay in your bed doing nothing all the day, but make sure to not stop people from working in what they want.
i have no empathy for lazy people.

hua · 2025 年5 月 3 日 02:05

Check you own reply in #3, isn’t it 古琪記 in your result line 130?

Waiting for your reply and I will do some admin work about your words.

anon38366553 · 2025 年5 月 3 日 23:58

he was obviously trolling.
as someone said before “let those who do the work decide how they do it”

people get discouraged from sharing dictionaries if they’re treated like that.

hua · 2025 年5 月 4 日 01:14

You are obviously not answering my question. Than’s fine, come back in two weeks then.

From what I understand, what @mixivivo said (aka error in OCRed text, such as 古事記) is correct. This can be reflected from your contradictory response in #3 (line 130) and #16. I do not think @mixivivo is trolling here. Else you want to say that Mecab can repair such error 古事記 ↔ 古琪記? Anybody is welcome to point out if I am wrong, including you whose silence ban will end in two weeks.

CC @mixivivo

Oubunsha oukogo 旺文社 古語辞典 第十版

Oubunsha oukogo 旺文社古語辞典第十版