pdf文件是质量非常好的文字pdf,我尝试着用pdfminer把它转化成xml或者html后,都遇到了同样的问题,就是文段的顺序不对。
比如这一页,
xml中相关的textbox按顺序是
<textbox id="1" bbox="84.960,741.233,113.039,775.869">
<textline bbox="84.960,757.313,88.439,775.869">
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="84.960,757.313,88.439,775.869" size="18.555"> </text>
<text>
</text>
</textline>
<textline bbox="84.960,741.233,113.039,763.476">
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="84.960,741.233,95.787,759.789" size="18.555">Q</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="95.760,741.233,105.808,759.789" size="18.555">A</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="105.840,748.750,110.399,760.907" size="12.157">1</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="110.400,749.399,113.039,763.476" size="14.076"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="2" bbox="84.960,628.193,124.679,647.867">
<textline bbox="84.960,628.193,124.679,647.867">
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="84.960,635.710,89.519,647.867" size="12.157">o</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="89.280,628.401,99.361,645.676" size="17.276">Q</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="99.361,628.401,108.716,645.676" size="17.276">A</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="108.720,635.386,112.919,646.583" size="11.197">3</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="113.040,628.193,117.674,646.707" size="18.514">-</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="117.600,628.193,121.079,646.707" size="18.514"> </text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="121.200,628.193,124.679,646.707" size="18.514"> </text>
<text>
</text>
</textline>
</textbox>
<textbox id="3" bbox="112.080,676.913,556.199,743.949">
<textline bbox="120.240,725.393,552.314,743.949">
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="120.240,725.393,128.743,743.907" size="18.514">Х</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="128.646,725.393,135.605,743.907" size="18.514">а</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="135.605,725.393,141.673,743.907" size="18.514">л</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="141.617,725.393,147.796,743.907" size="18.514">х</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="147.840,725.393,151.319,743.684" size="18.291">.</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="151.431,725.393,154.910,743.684" size="18.291"> </text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="155.760,725.393,162.719,743.949" size="18.555">х</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="162.482,725.393,169.441,743.949" size="18.555">а</text>
<text font="AHUFFA+TimesNewRomanPS-BoldMT" bbox="169.441,725.393,176.399,743.949" size="18.555">а</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="176.400,725.393,179.879,743.684" size="18.291">,</text>
<text font="BGMXTR+TimesNewRomanPSMT" bbox="179.991,725.393,183.470,743.684" size="18.291"> </text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="184.080,725.393,191.080,743.907" size="18.514">б</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="191.053,725.393,197.232,743.907" size="18.514">у</text>
<text font="AHDCWA+TimesNewRomanPS-ItalicMT" bbox="197.287,725.393,204.246,743.907" size="18.514">р</text>
...
分别是第一个词条的词头、第三个词条的词头、第一个词条的内容,词头和词条内容就对应不上了。这样的问题有办法解决吗?
3-tom-s-korrek-szhatyj.pdf (1.8 MB)
vol3.xml.zip (4.9 MB)