用 PaddleOCR 生成双层 pdf

05.09更新

  • 新增reocr脚本,只处理pdf的特定页面

  • 修复了竖排文本的问题

04.28更新

  • 经测试,更大的服务器默认模型的边际收益已经很小了,所以改成默认使用小模型。

  • 使用字符集更大的默认字体,以避免 pdf 字体出错。

  • 修复了pdf生成中的一些问题。


写在前面:

  • 本脚本适合希望调用 N 卡加速 OCR 识别的用户。如果只能在 CPU 上运行,使用 C 语言编程而且集成化的 Umi-OCR 可能是更好的选择;如果要在 A 卡或 I 卡运行,请自行探索;由于 CUDA 配置受显卡型号影响较大,无法打包发布,需要按下面的方法自行配置。

  • 目前存在的问题:

  • 本项目生成的双层 pdf 旨在方便搜索,没有进行版面处理(例如多栏、表格、页眉页脚等等),因此复制出来的结果很可能是混乱的;

  • 中英文混排时,英文的空格识别有时候有问题,经常几个单词连在一起。这好像是Paddle OCR的问题,暂时无法解决。

  • 受模型限制,无法处理中文和英语、日语外小语种的混排文档。

  • 顺带一提,按现在的发展势头,ai消灭扫描版书籍估计也就是这一两年的事情了。也许需要考虑一下折腾双层 pdf 还值不值得。

特别鸣谢

双层 pdf 的生成参考了 hiroi-sora/Umi-OCR 的实现。

配置环境

首先安装 python。我这里在 python3.10 下测试通过,理论上 3.10 ~ 3.12 应该都可以。

在任一文件夹下创建虚拟环境:


python3.10 -m venv .venv

激活虚拟环境(以 Windows 为例):


.venv\Scripts\activate

安装 PaddlePaddle GPU 版。我这里在 3.2.0 下测试通过。先运行 nvidia-smi 测试支持的 cuda 版本,如果低于 12.9 则需要将 cu129 改为 cu126 或 cu118。50 系显卡可能需要参考该文档:飞桨框架安装 - PaddleOCR 文档


python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu129/

安装 Paddle OCR 和 pymupdf。


pip install paddleocr pymupdf

将本文附带的脚本放置到文件夹下,激活虚拟环境后就可以这样运行:


python paddle_pdf.py [参数]

命令行参数

必需参数

  • 输入的 PDF 文件路径

可选参数

  • -o, --output:输出的 PDF 文件路径

  • 默认为同一路径下 {输入文件基本名}_ocr.pdf

  • –model:指定 OCR 识别模型

| 参数值 | 对应的 PaddleOCR 模型 | 主要支持语言 |

|--------|---------------------|-------------|

| dft(默认) | PP-OCRv5_mobile_rec | 中文、英文、日文 |

| ko | korean_PP-OCRv5_mobile_rec | 韩文、英文 |

| ltn | latin_PP-OCRv5_mobile_rec | 拉丁语系(法、德、西等)、英文 |

| cyr | eslav_PP-OCRv5_mobile_rec | 俄文等斯拉夫语系、英文 |

| gk | el_PP-OCRv5_mobile_rec | 希腊文、英文 |

| en | en_PP-OCRv5_mobile_rec | 英文(对纯英文文档优化) |

使用示例

基本用法


python paddle_pdf.py input.pdf

将生成 input_ocr.pdf 文件,使用默认模型(中/英/日)

指定输出文件名


python paddle_pdf.py input.pdf -o output.pdf

使用特定语言模型


# 处理韩文文档

python paddle_pdf.py korean_doc.pdf --model ko

# 处理俄文文档

python paddle_pdf.py russian_doc.pdf --model cyr

# 处理纯英文文档

python paddle_pdf.py english_doc.pdf --model en

paddle_reocr.py 脚本用于重新识别pdf中特定的页面。此脚本会逐页渲染为图片再识别,性能上自然有劣势,只建议需要识别少量页面时使用。


python paddle_reocr.py input.pdf -p "3,5,7,9-11" --model en

附带了一个remove_text.py脚本,可以删除双层 pdf 中的所有文字,并移除内嵌字体:


python remove_text.py "文件.pdf" "结果.pdf"

该脚本也会删除所有的文字内容,请谨慎使用。

效果测试

对生僻字的支持还可以,但太奇怪的还是不支持。

拼音的识别效果不错。

英文和其他拉丁语系混合文本识别效果:

脚本下载见此处:

paddle_pdf.zip (7.7 KB)

I plan to get a hard copy of Oxford Learner’s Thesaurus. This is the perfect model i was looking. I got a CUDA support GPU. I have a quick question. How many time CUDA accelerates compare to CPU?

Of course, this depends on the hardware level of the CPU and GPU. According to info from Paddle OCR (source: 使用教程 - PaddleOCR 文档), generally GPUs will be 3x~8x faster than CPUs.

However, if one wishes to create a textual dictionary, OCR models of this scale may not be sufficient (e.g., PaddleOCR doesn’t seem to support IPA). A larger model PaddleOCR-VL may be a better choice.

I tested with the page images obtained from the internet:

And this is the result with PaddleOCR-VL — In my opinion, it is already quite satisfactory:

## ways of cooking

[the image, omitted]

grill (BrE) / broil (AmE)

## bake verb See also the entry for COOK

take • fry • roast • grill • broil • toast • barbecue

these are all words for ways of cooking food.

## ATTERNS AND COLLOCATIONS

to fry/ roast/ grill/ broil/ barbecue chicken

to fry/ grill/ barbecue a steak

to fry/ grill/ barbecue sausages

to bake/ fry/ grill/ broil fish

to fry/ grill bacon

to bake/ fry/ roast potatoes

to bake/ fry/ toast bread

to roast/ toast nuts

ake [T, I] to cook food, especially bread, cakes and potatoes, in an oven without extra fat or liquid; to be cooked in this way: baked potatoes ◇ I'm baking a birthday cake for Alex. ◇ I'm baking Alex a cake. ◼ Bake is not used about cooking meat; use roast instead. See also baking → COOKING

y [T, I] to cook sth in hot fat or oil; to be cooked in this way: fried fish/eggs ◇ I woke up to the smell of bacon frying.

past [T, I] to cook meat without liquid in an oven or over fire; to cook vegetables in oil or fat in an oven; to cook nuts or beans in order to dry them and turn them brown; to be cooked in any of these ways: You should boil the potatoes for a little before you roast them. The smell of roasting meat came from the kitchen. The past participle of roast is usually roasted, except when it is used before noun, when roast is usually used: roast beef/chicken/potatoes/parsnips However, roasted beef/chicken/potatoes/parsnips However, roasted is used to describe nuts and beans: roasted chestnuts/coffee beans/peanuts

ill [T] to cook food under or over a very strong heat: Grill he sausages for ten minutes, turning occasionally. At night we used to grill steaks over charcoal in the open air. Food can be grilled in an oven under the grill (BrE) or broiler (AmE), or outdoors over a fire; however, it is even more frequent to talk about barbecuing food that is booked outdoors.

broil [T] (AmE) to cook meat or fish under direct heat: We ate broiled chicken with vegetables. In British English use grill for this.



toast [T, I] to make sth, especially bread, turn brown by heating it in a toaster or close to heat; to turn brown in this way: a toasted sandwich ◇ Place under a hot grill until the nuts have toasted.

barbecue /'bɑːbɪkjʊː; AmE 'bɑːrb-/ [T] to cook food on a barbecue (= a metal frame on or over an open fire outdoors): barbecued chicken

## ban noun

## ban · sanction · boycott · embargo · prohibition · moratorium · veto · taboo

These are all words for a rule, order or custom which does not allow people to do sth or for the act of stopping sth from being done.

## PATTERNS AND COLLOCATIONS

a ban/sanctions/a boycott/an embargo/a prohibition/a veto/a moratorium/a taboo on sb/sth

a ban/sanctions/a boycott/an embargo/a prohibition/a veto/a taboo against sb/sth

a total ban/boycott/embargo/prohibition/moratorium

(an) international ban/sanctions/boycott/embargo/moratorium

(a) trade ban / sanctions / boycott / embargo

(an) economic sanctions/ boycott/ embargo

to impose a ban/sanctions/a boycott/an embargo/a prohibition/a moratorium/a veto

to call for/introduce a ban/sanctions/a boycott/a prohibition/a moratorium

to enforce/tighten/ease a ban/sanctions/an embargo

▶ to comply with a ban/sanctions/a prohibition

▶ to break a ban/sanctions/an embargo/a taboo

to lift a ban/sanctions/a boycott/an embargo/a prohibition/a veto

a ban/sanctions/an embargo come/comes into force

ban [C] an official rule that says that sth is not allowed; the fact of sb being officially stopped from doing sth for a period of time as a punishment: There is to be a total ban on smoking in the office. ◇ The students took to the streets, defying a ban on political gatherings. ◇ The sprinter received a lengthy ban for failing a drugs test. ◇ He faces a possible life ban from international football.

sanction /'sæŋkʃn/ [C, usually pl.] an official order that limits trade or contact with a particular country, in order to make it do sth, such as obeying international law: Trade sanctions were imposed against any country that refused to sign the agreement.

boycott/ˈbɔɪkɒt; AmE -kɑːt/ [C] the act of refusing to buy, use or take part in sth as a way of protesting: Opposition groups declared a boycott of the elections. ◇ The group is calling for a consumer boycott of the company’s products. See also boycott → AVOID verb

embargo /ɪmˈbɑːɡəʊ; AmE ɪmˈbɑːrɡəʊ/ [C] an official order that forbids trade with another country, sometimes of a particular type of goods: There is a strict embargo on oil imports. ◇ We knew the arms embargo was being broken.

prohibition /ˌprəʊˈbɪʃn; AmE ˌprəʊəˈb- / ʌ, C] the act of stopping sth being done or used, especially by law; a law or rule that says that sth is not allowed: The RSPB has called for the prohibition of all imports of wild birds.

moratorium /ˌmɒrəˈtɔːriəm; AmE ˌmɔːr- / (pl. mortatoriums or moratoria) [C] an official agreement that an activity must stop for a period of time: The convention called for a two-year moratorium on commercial whaling. veto /ˈviːtəʊ; AmE -toʊ/ (pl. -oes) [C] an occasion when sb uses their right to refuse to allow sth to be done: For months there was a veto on employing new staff. See also veto → REFUSAL noun, veto → REFUSE verb

taboo/tə'bu:/ [C] a cultural or religious custom that does not allow people to do, use or talk about a particular thing as people find it offensive or embarrassing; a

But the styles, such as strikethroughs and italics, are still omitted.

PaddleOCR-VL should be able to run locally or use APIs.

Thank you for testing this.

刚发现latin模型也能识别德语旧字体,效果也不错: