有没有办法把PDF的OCR抽出来，成为单独档案？

mixivivo · 2024 年9 月 8 日 05:18

PDF文件的文本层“提取”、“写入”都是比较容易实现的，比如下面就是提取的python代码：

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
  """Show location and text of LTItem and all its descendants"""
  if depth == 0:
      print('element                        x1  y1  x2  y2   text')
      print('------------------------------ --- --- --- ---- -----')

  print(
      f'{get_indented_name(o, depth):<30.30s} '
      f'{get_optional_bbox(o)} '
      f'{get_optional_text(o)}'
  )

  if isinstance(o, Iterable):
      for i in o:
          show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
  """Indented name of LTItem"""
  return '  ' * depth + o.__class__.__name__


def get_optional_bbox(o: Any) -> str:
  """Bounding box of LTItem if available, otherwise empty string"""
  if hasattr(o, 'bbox'):
      return ''.join(f'{i:<4.0f}' for i in o.bbox)
  return ''


def get_optional_text(o: Any) -> str:
  """Text of LTItem if available, otherwise empty string"""
  if hasattr(o, 'get_text'):
      return repr(o.get_text().strip())
  return ''


path = Path(r"C:\Users\xxx\Desktop\武器和战争的演变.pdf").expanduser()

pages = extract_pages(path)
show_ltitem_hierarchy(pages)

我这里上传一页双层PDF，武器和战争的演变.pdf (53.8 KB) ，它的运行结果如下，有文字，也有每行、每个字的坐标位置。

武器和战争的演变 OCR文字层（文本和坐标）.txt (47.2 KB)

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
generator
  LTPage                       0   0   392 576  
    LTTextBoxHorizontal        77  466 334 476  '尽管最初铁的造价十分昂贵，而且产量有限，但是，铁的'
      LTTextLineHorizontal     77  466 334 476  '尽管最初铁的造价十分昂贵，而且产量有限，但是，铁的'
        LTChar                 77  466 88  476  '尽'
        LTChar                 88  466 98  476  '管'
        LTChar                 98  466 108 476  '最'
        LTChar                 108 466 118 476  '初'
        LTChar                 118 466 129 476  '铁'
        LTChar                 129 466 139 476  '的'
        LTChar                 139 466 149 476  '造'
        LTChar                 149 466 159 476  '价'
        LTChar                 159 466 170 476  '十'
        LTChar                 170 466 180 476  '分'
        LTChar                 180 466 190 476  '昂'
        LTChar                 190 466 201 476  '贵'
        LTChar                 201 466 211 476  '，'
        LTChar                 211 466 221 476  '而'
        LTChar                 221 466 231 476  '且'
        LTChar                 231 466 242 476  '产'
        LTChar                 242 466 252 476  '量'
        LTChar                 252 466 262 476  '有'
        LTChar                 262 466 272 476  '限'
        LTChar                 272 466 283 476  '，'
        LTChar                 283 466 293 476  '但'
        LTChar                 293 466 303 476  '是'
        LTChar                 303 466 314 476  '，'
        LTChar                 314 466 324 476  '铁'
        LTChar                 324 466 334 476  '的'
        LTAnno                  ''
    LTTextBoxHorizontal        57  451 335 460  '发现毕竟给古代兵器和战争带来了巨大的影响。到了公元前'
      LTTextLineHorizontal     57  451 335 460  '发现毕竟给古代兵器和战争带来了巨大的影响。到了公元前'
        LTChar                 57  451 68  460  '发'
        LTChar                 68  451 79  460  '现'
        LTChar                 79  451 89  460  '毕'
        LTChar                 89  451 100 460  '竟'
        LTChar                 100 451 111 460  '给'
        LTChar                 111 451 121 460  '古'
        LTChar                 121 451 132 460  '代'
        LTChar                 132 451 143 460  '兵'
        LTChar                 143 451 153 460  '器'
        LTChar                 153 451 164 460  '和'
        LTChar                 164 451 175 460  '战'
        LTChar                 175 451 186 460  '争'
        LTChar                 186 451 196 460  '带'
        LTChar                 196 451 207 460  '来'
        LTChar                 207 451 218 460  '了'
        LTChar                 218 451 228 460  '巨'
        LTChar                 228 451 239 460  '大'
        LTChar                 239 451 250 460  '的'
        LTChar                 250 451 260 460  '影'
        LTChar                 260 451 271 460  '响'
        LTChar                 271 451 282 460  '。'

……

这些数据拿到，当然可以反向操作，把文字按照坐标位置一一写入pdf。网上的示例程序多得很，我就不继续罗嗦了。