Oubunsha oukogo 旺文社 古語辞典 第十版

does someone has that dictionary?

1 Like

finally found

data extracted from the scanned pdf:

now i think the hard part will be the regex

2 Likes

what is the best OCR software for Japanese? or which software did you use to extract data from the 旺文社 古語辞典? thanks.

it was already OCRed.

Yeah, I see. “Producer: ABBYY finereader 12”. It seems there are many errors in the extracted text.

there aren’t errors.
I just need to normalize the utf8 data

討の帰途、碓日(うすい)の坂(「古事記」では足柄(あしがら)峠)で、妃の弟
橘媛(おとたちばなひめ)が自分のために投身したのを悲しみ、「吾妻

selecting with the mouse will (obviously) lead to imprecise data, that’s why i created a python code to extract the data from the pdf

11
討の帰途、碓日(うすい)の坂(「古事記」では足柄(あしがら)峠)で、妃の弟橘媛(おとたちばなひめ)が自分のために投身したのを悲しみ、「吾妻

和你说的这些无关,对比一下字符就知道了,因为图像底本不佳,也可能因为finereader性能有限,识别结果有不少文字错误。这种错误率的词典基本是没法用的,没必要在它上面浪费时间,建议寻找更好的图像底本。

2 Likes

when having scanned dictionary , the definition of a word can end in another page, that’s a regular thing.

if you pay attention to what I’ve said before, the right way of getting things in the correct order would be using regex for formatting.
as for the 「吾妻 at the end, it’s because this text editor doesn’t show the rest of the line, ofc there’s more content in there, or do you think that a single page would have only that little amount of text? don’t play the dumb.

putting the text side by side just reveals that the extraction was successful, so you pasted a piece, didn’t noticed that it was fully correct, and judged that there wasn’t “more” ? even tho it’s obvious that the line had more content in the picture? …

translation of your chinese text:
It has nothing to do with what you said, comparing the characters shows that there are quite a few text errors in the recognition results because of the poor image substrate, and possibly because of the limited performance of the finereader. A dictionary with this error rate is basically unusable, there is no point in wasting time on it, and it is advisable to look for better image substrates.

wasting time on it

分からん野郎だな。
stop inferring too much about the whole thing by just looking at a single picture.

that’s the wrong page tho, smart ass.

the code will count the cover of the pdf as page 1.

also, where did you saw that?
there isn’t anything to do with it.
the pdf already had text.

proof:

So you can’t see the clear difference in the text in the following two places?

古事記 ↔ 古琪記

峠妃橘溝 ↔ 峠)で、妃の弟橘媛

That’s fine, you like wasting your time, go ahead and waste it, but I don’t have that much time to spare.

1 Like

lmfao, are you trolling?
you can’t identify the character ?

i’ll assume you’re doing it on purpose (aka trolling).
there’s no way that someone cant recognize this.

can you explain how did you saw a “琪” in there???

also, i mentioned about “text normalization”.
but i wont explain, people who are aware of mecab and other tools know that text can be easily repair by using this.

make a favor for all of us and keep out.
邪魔するな

i’m also working at a IA that can correct japanese phrases too:

exported the model to onnx, and will work at it (already was planned).
what you call “waste of time” , we call work.
stay in your bed doing nothing all the day, but make sure to not stop people from working in what they want.
i have no empathy for lazy people.

Check you own reply in #3, isn’t it 古琪記 in your result line 130?


Waiting for your reply and I will do some admin work about your words.

he was obviously trolling.
as someone said before “let those who do the work decide how they do it”

people get discouraged from sharing dictionaries if they’re treated like that.

You are obviously not answering my question. Than’s fine, come back in two weeks then.


From what I understand, what @mixivivo said (aka error in OCRed text, such as 古事記) is correct. This can be reflected from your contradictory response in #3 (line 130) and #16. I do not think @mixivivo is trolling here. Else you want to say that Mecab can repair such error 古事記 ↔ 古琪記? Anybody is welcome to point out if I am wrong, including you whose silence ban will end in two weeks.

CC @mixivivo

2 Likes