Longman New Junior English Dictionary mdx

In relation to the following posts:

.

On my search for a basic English dictionary, the Longman New Junior English Dictionary (NJED) got my attention and there is a downloadable PDF file at Longman New Junior English Dictionary (NJED) - Anna’s Archive . However, there exists no MDict mdx for this dictionary.

Could anybody in this forum help to convert this PDF into a searchable MDict mdx? Or could anybody here share some tips of how to efficiently convert such a PDF file into an MDict mdx? OCR or making screenshots.

2 Likes

If you are keen, make an index of the book’s headwords:

Headword|page number
A|1
aback|1
abbess|2

Save the index in a Unicode text file.

Or you may input the index in a xls(Excel) file.

Upload the index to this forum and seek help.

Another method:

1.Use Finereader to ocr the pdf,save the ocr texts with original page number, one page one text file. 0001.txt, 0002.txt, etc.

2.Mark the headwords with tag in the texts.

【A】
【aback】
【abbess】

Upload the marked files in zip format to this forum and seek help.

1 Like

Recently, I have been reading about Longman dictionaries, and by chance encountered a PDF released by the Longman department in Hongkong. In the PDF, there is a list of recommended dictionaries for different stages of learning/schooling.

Notably,

  • Longman Dictionary of Contemporary English for universities
  • Longman Active Study Dictionary for high schools
  • Longman New Junior Dictionary for primary schools

There are already mdx files made for 2 of the 3. Hopefully, the last one can be made in the future with better source data.

I have tried generating OCR text for the above mentioned PDF. However, due to the poor quality of the PDF, especially the black vertical line on every page, the OCR result text is also in very poor quality, therefore, not usable for making mdx.

More details are in the following page:

这个词典的pdf图像质量很差,即使用比较先进的OCR工具识别,也会有相当多的错误,需要大量的人工编辑和校对。全书OCR识别并不是问题,困难在于后续的纠错、编辑工作,如果有可能的话,先设法扫描一个清晰的图像底本,会使整个工作流更简单省事。

True that the poor quality of PDF results in poor OCR text.

Unfortunately, this is an rather old book which seems to be discontinued by Longman. It is not so easy to get an paper copy of this book.

With the advancement of technology, I am somehow optimistic about the possibility to get a perfect copy of OCR text in few years. (I think, if I can read it, there is no reason that the machine cannot read it better, especially, when it is very much curated dictionary with a rather limited number of words in use. Anyway, these days, the machine is already beginning to be somehow artificial intelligent.)

Thanks for your effort working on this.

The resulted text that you posted in the other thread looks already wonderful. Is it possible that you make a full set of text of the pdf pages?

我正在识别,请稍候,大致看来ocr结果还不错,但有一些错误,比如把音标 ʒ 识别 3,也可能有遗漏内容(尤其在每一栏的开头部分,它偶尔不是单独词条,而是接续上文的,有时会丢失或者弄错),或者肆意添加的内容(LLM 的内在缺点),我就不负责编辑校对纠正了,在此只提供初步OCR的txt文本。

2 Likes

OCR好了,但我并没有校对,一页页查看可能的错误,请谨慎使用。

Longman New Junior English Dictionary (OCR).txt (1.0 MB)

Thanks a lot. By the way, I have been searching for such a basic/small/compact dictionary with the wish to use the content in my flashcards. Sure, as I know where the source text is from, I will take extra care when I read through the text.

供有兴趣做高清扫描的同好参考:
此书的英文原版第3版2002年出版,改名为Longman Basic English Dictionary,目前在各大电商平台仍可买到,只是因为早已停印,卖一本少一本,价格比较高。
英汉双解版书名《朗文初阶英汉双解词典》,上海外语教育出版社2017年出版,32开本书号9787544645386,64开本书号9787544645690,目前均正常在售。双解版封面上的英文书名仍作Longman New Junior Dictionary,但版权页写明了“Original title: Longman Basic English Dictionary”。

供参考:

Chinese [zh], .pdf, :rocket:/upload, 104.5MB, :green_book: Book (unknown), upload/duxiu_main2/【星空藏书馆】/【星空藏书馆】等多个文件/分享阁(014)/综合类书库(046)/PDF涔〉簱4.6T/2022更新等多个文件/01/1/1-4/朗文初级英汉双解词典 第二版_10432828__北京市:外语教学与研究出版社_2001.pdf
朗文初阶英汉双解词典 第2版
北京:外语教学与研究出版社, 2001
培生教育出版北亚洲有限公司词典部编

Chinese [zh], English [en], .pdf, :rocket:/lgli/lgrs/nexusstc/zlib, 2.0GB, :blue_book: Book (non-fiction), nexusstc/朗文高阶英汉双解词典 新版 / Longman Advanced American Dictionary/c14ef58d4cb7aa694f016f442890cdef.pdf
朗文高阶英汉双解词典 新版 / Longman Advanced American Dictionary
外语教学与研究出版社, 2, 2013
英国培生教育出版集团

文字识别免不了出错,更不用说音标了。音标可以用靠得住的带有音标的词头导入,这样就不用校对了。

北京外研社早年间出的朗文初阶双解,确实是这本Longman New Junior Dictionary第2版的双解版。
这本词典第3版双解版的版权被上海外教社拿走后,外研社和培生合作搞了另一本《朗文当代初级英语辞典(英英·英汉双解)》,是从《朗文当代高级英语辞典(英英·英汉双解)》(英文原版LDOCE)改编的。
至于外研社的《朗文高阶英汉双解词典》,其英文原版则是Longman Advanced American Dictionary,是一部和LDOCE体量相当的美式英语词典,和这本Longman New Junior Dictionary就没啥关系了。

Now I have used the text to supplement some of my flash cards for a quick grasp of the essentials of the corresponding words. The text content is great by utilizing the text editor search and find but just a bit inconvenient to distinguish different pieces of text if there is an idiom or phrase (incl. phrase verb) at the beginning of a definition line as the boldface information is lost in the OCR text.
Is it possible to tell Gemini to keep such information? Like adding some markups? Or even asking it to export html instead of plain text? Thanks again for your earlier help of making the OCR text.

Sure, Gemini can directly output formats like HTML and JSON, but the tradeoff is that if you require more complex functionality, there’s a high chance it could break something, leading to more errors, etc.

I think for OCR, text accuracy is the most important consideration. Also, since this dictionary has a relatively small vocabulary, it’s easy to fix formatting issues on your own.

1 Like