Pdf转换成xml后元素的顺序不对应该怎么解决?

pdf文件是质量非常好的文字pdf,我尝试着用pdfminer把它转化成xml或者html后,都遇到了同样的问题,就是文段的顺序不对。

比如这一页,

xml中相关的textbox按顺序是

<textbox id="1" bbox="84.960,741.233,113.039,775.869">
<textline bbox="84.960,757.313,88.439,775.869">
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="84.960,757.313,88.439,775.869" size="18.555"> </text>
<text>
</text>
</textline>
<textline bbox="84.960,741.233,113.039,763.476">
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="84.960,741.233,95.787,759.789" size="18.555">Q</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="95.760,741.233,105.808,759.789" size="18.555">A</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="105.840,748.750,110.399,760.907" size="12.157">1</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="110.400,749.399,113.039,763.476" size="14.076"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="2" bbox="84.960,628.193,124.679,647.867">
<textline bbox="84.960,628.193,124.679,647.867">
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="84.960,635.710,89.519,647.867" size="12.157">o</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="89.280,628.401,99.361,645.676" size="17.276">Q</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="99.361,628.401,108.716,645.676" size="17.276">A</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="108.720,635.386,112.919,646.583" size="11.197">3</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="113.040,628.193,117.674,646.707" size="18.514">-</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="117.600,628.193,121.079,646.707" size="18.514"> </text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="121.200,628.193,124.679,646.707" size="18.514"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="3" bbox="112.080,676.913,556.199,743.949">
<textline bbox="120.240,725.393,552.314,743.949">
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="120.240,725.393,128.743,743.907" size="18.514">Х</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="128.646,725.393,135.605,743.907" size="18.514">а</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="135.605,725.393,141.673,743.907" size="18.514">л</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="141.617,725.393,147.796,743.907" size="18.514">х</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="147.840,725.393,151.319,743.684" size="18.291">.</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="151.431,725.393,154.910,743.684" size="18.291"> </text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="155.760,725.393,162.719,743.949" size="18.555">х</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="162.482,725.393,169.441,743.949" size="18.555">а</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="169.441,725.393,176.399,743.949" size="18.555">а</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="176.400,725.393,179.879,743.684" size="18.291">,</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="179.991,725.393,183.470,743.684" size="18.291"> </text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="184.080,725.393,191.080,743.907" size="18.514">б</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="191.053,725.393,197.232,743.907" size="18.514">у</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="197.287,725.393,204.246,743.907" size="18.514">р</text>
...

分别是第一个词条的词头、第三个词条的词头、第一个词条的内容,词头和词条内容就对应不上了。这样的问题有办法解决吗?

3-tom-s-korrek-szhatyj.pdf (1.8 MB)
vol3.xml.zip (4.9 MB)

One of the common issues with PDF text extraction is, that text may not appear in any particular reading order.

This is the responsibility of the PDF creator (software or a human). For example, page headers may have been inserted in a separate step – after the document had been produced. In such a case, the header text will appear at the end of a page text extraction (although it will be correctly shown by PDF viewer software).

把每一页的最小元素按 (y1, x0) 排序,就能得到自然阅读顺序(上至下,左至右)。bbox 是两点坐标 (x0, y0, x1, y1) 确定的矩形。

PyMuPDF 的话,可试试 Page.get_text(“dict”, sort=True)

2 Likes

谢谢啊!排序挺好用的。

我还试了把pdf按页分别转成html或者xml,顺序都是正确的,也不知道是什么原理