PDF math formulas to txt

363 views Asked by At

I have lots of PDF files with text, charts, formulas in it. Example

I wanted to find the position from the starting point to the ending point for the formulas. Using pdfminer.six and pypdf does not return the formulas in a right way. Tried the OCR tools and ScanSSD but they are way to old and gave tons of errors when setup.

2

There are 2 answers

2
K J On

the Question focus is page 23 of https://www.st.com/resource/en/datasheet/vl6180.pdf Font object # 483 SNR RESULT __RANGE_RETURN_SIGNAL_COUNT{0x6C} RESULT__RANGE_RETURN_AMB_COUNT{0x74} * 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - = so it starts with the S and ends with the = having gone around in a zigzag thus measure height as total heights and width as total widths

If you are lucky the "Figure" will have been tagged so we can see that object is MCID34

enter image description here

Thus we can search for its data. Here is the smaller first part heading as the object body is many lines long.

EMC
/Figure <</MCID 34>>BDC
q
1 i 
129.96 685.58 391.98 -38.76 re
W* n
123.96 749.54 403.98 -102.72 re
W* n
0 841.98 595.2 -841.98 re
W n
BT
/F1 1 Tf
9 0 0 9 200.04 662.7803 Tm
.06 Tc
(SNR)Tj
3.8267 .5533 TD
[(RE)6.7(SULT)]TJ
3.9467 0 TD
0 Tc
[(__RANGE)6.7(_RETURN_S)6.7(IGNAL_CO)6.7(UNT{0x6C)6.7(})]TJ

so left edge is 200.04 and at a GUESS the righthand end will be 450.### likewise the height is roughly 654 to 675. Normally we could not know with precision what scale the font letter would be, as it would be device specific, however here they are imbedded so are fixed dimensions.

enter image description here

For a programmable answer there are many applications and personally I will try MuPDF as suitable to give an accurate set of values.

<fill_text colorspace="DeviceGray" color="0" ri="1" bp="1" op="0" opm="1" transform="1 0 0 -1 0 842">
                <span font="JHKLGG+Courier" wmode="0" bidi="0" trm="9 0 0 9">
                    <g unicode="S" glyph="S" x="200.04" y="662.7803" adv=".6"/>
                    <g unicode="N" glyph="N" x="205.98" y="662.7803" adv=".6"/>
                    <g unicode="R" glyph="R" x="211.92" y="662.7803" adv=".6"/>
                    <g unicode="R" glyph="R" x="234.48029" y="667.75997" adv=".6"/>
                    <g unicode="E" glyph="E" x="240.42029" y="667.75997" adv=".6"/>

So we can say the upper left bound is x ="200.04" and for y 667.75997 + nominal 9 = nominally 675.75997(you can round that down as 9 is overkill) for the lower right bound we can use

 <g unicode=" " glyph="space" x="439.91856" y="656.71969" adv=".6"/>
                    <g unicode="6" glyph="six" x="445.31855" y="656.71969" adv=".6"/>

thus x = 450.719 any y = 656.720 but that y value will need rounding down for descenders, so at a guess 655 same as done by eye.

0
vin On

One way of getting the formulas is to read the Figures from the raw data.. the formula you pointed is a Figure and is a marked text content. The way txt is rendered is to move the text position show text and move text position etc. All these rules are in the PDF itself. They way you want to generate text from these rules depends on how you want to consume these formulas. Following is a snippet

import PyPDF2
import re


if __name__ == "__main__":

    document = PyPDF2.PdfReader("vl6180.pdf")
    figures_data = []
    for i, page in enumerate(document.pages):
        print(f"Processing Page {i}")
        if isinstance(page['/Contents'].get_object(), PyPDF2.generic._data_structures.EncodedStreamObject):
            page_data = page['/Contents'].get_object().get_data().decode("utf-8", errors='ignore')
            fig_indices = [m.start() for m in re.finditer("/Figure", page_data)]
            for figure_data_start in fig_indices:
                figure_data_end = page_data[figure_data_start:].index("EMC")
                figure_raw_data = page_data[figure_data_start:figure_data_start + figure_data_end]
                figures_data.append(figure_raw_data.split("\n"))
    print(f"Found {len(figures_data)} figures")

...
Found 4 figures

So this is the data for one of the formulas as the text is rendered in text positions you can choose how you want to consume the formula text

\n/Figure <</MCID 34>>
BDC
\nq
\n1 i 
\n129.96 685.58 391.98 -38.76 re
\nW* n
\n123.96 749.54 403.98 -102.72 re
\nW* n
\n0 841.98 595.2 -841.98 re
\nW n
\nBT
\n/F1 1 Tf
\n9 0 0 9 200.04 662.7803 Tm
\n.06 Tc
\n(SNR)Tj
\n3.8267 .5533 TD\
n[(RE)6.7(SULT)]TJ\n3.9467 0 TD\n0 Tc
\n[(__RANGE)6.7(_RETURN_S)6.7(IGNAL_CO)6.7(UNT{0x6C)6.7(})]TJ
\n-4.24 -1.2267 TD\n[(R)-60(E)-60(S)-53.3(U)-60(L)-60(T)-53.3(__RANGE_)6.7(RETURN_A)6.7(MB_COUNT)6.7({0x74} *)6.7( 6)]TJ
\n0 .6733 TD
\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.2933 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.2933 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.2933 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.2933 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.2933 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.3 0 TD\n(-)Tj\n.0533 0 TD\n(-)Tj
\n-24.84 0 TD
\n(=)Tj\nET
\nEMC

Some more documentation::

Text Matrix (Tm): The text matrix is set to define the transformation for text rendering, including scaling, rotation, and translation


Move Text Position (Td or TD): Adjust the text position using either Td (relative) or TD (absolute)


Show Text (Tj or TJ): Display the actual text using either Tj (show text) or TJ (show text with explicit glyph positioning)

Note:: TD /Td will be before the Tj Imagine this like any other rendering software