On my search for a basic English dictionary, the Longman New Junior English Dictionary (NJED) got my attention and there is a downloadable PDF file at Longman New Junior English Dictionary (NJED) - Anna’s Archive . However, there exists no MDict mdx for this dictionary.
Could anybody in this forum help to convert this PDF into a searchable MDict mdx? Or could anybody here share some tips of how to efficiently convert such a PDF file into an MDict mdx? OCR or making screenshots.
Recently, I have been reading about Longman dictionaries, and by chance encountered a PDF released by the Longman department in Hongkong. In the PDF, there is a list of recommended dictionaries for different stages of learning/schooling.
Notably,
Longman Dictionary of Contemporary English for universities
Longman Active Study Dictionary for high schools
Longman New Junior Dictionary for primary schools
There are already mdx files made for 2 of the 3. Hopefully, the last one can be made in the future with better source data.
I have tried generating OCR text for the above mentioned PDF. However, due to the poor quality of the PDF, especially the black vertical line on every page, the OCR result text is also in very poor quality, therefore, not usable for making mdx.
True that the poor quality of PDF results in poor OCR text.
Unfortunately, this is an rather old book which seems to be discontinued by Longman. It is not so easy to get an paper copy of this book.
With the advancement of technology, I am somehow optimistic about the possibility to get a perfect copy of OCR text in few years. (I think, if I can read it, there is no reason that the machine cannot read it better, especially, when it is very much curated dictionary with a rather limited number of words in use. Anyway, these days, the machine is already beginning to be somehow artificial intelligent.)
Thanks a lot. By the way, I have been searching for such a basic/small/compact dictionary with the wish to use the content in my flashcards. Sure, as I know where the source text is from, I will take extra care when I read through the text.
供有兴趣做高清扫描的同好参考:
此书的英文原版第3版2002年出版,改名为Longman Basic English Dictionary,目前在各大电商平台仍可买到,只是因为早已停印,卖一本少一本,价格比较高。
英汉双解版书名《朗文初阶英汉双解词典》,上海外语教育出版社2017年出版,32开本书号9787544645386,64开本书号9787544645690,目前均正常在售。双解版封面上的英文书名仍作Longman New Junior Dictionary,但版权页写明了“Original title: Longman Basic English Dictionary”。
北京外研社早年间出的朗文初阶双解,确实是这本Longman New Junior Dictionary第2版的双解版。
这本词典第3版双解版的版权被上海外教社拿走后,外研社和培生合作搞了另一本《朗文当代初级英语辞典(英英·英汉双解)》,是从《朗文当代高级英语辞典(英英·英汉双解)》(英文原版LDOCE)改编的。
至于外研社的《朗文高阶英汉双解词典》,其英文原版则是Longman Advanced American Dictionary,是一部和LDOCE体量相当的美式英语词典,和这本Longman New Junior Dictionary就没啥关系了。
Now I have used the text to supplement some of my flash cards for a quick grasp of the essentials of the corresponding words. The text content is great by utilizing the text editor search and find but just a bit inconvenient to distinguish different pieces of text if there is an idiom or phrase (incl. phrase verb) at the beginning of a definition line as the boldface information is lost in the OCR text.
Is it possible to tell Gemini to keep such information? Like adding some markups? Or even asking it to export html instead of plain text? Thanks again for your earlier help of making the OCR text.
Sure, Gemini can directly output formats like HTML and JSON, but the tradeoff is that if you require more complex functionality, there’s a high chance it could break something, leading to more errors, etc.
I think for OCR, text accuracy is the most important consideration. Also, since this dictionary has a relatively small vocabulary, it’s easy to fix formatting issues on your own.