How to get the specific coordinates of each contents in PDF file?

Question

How to get the specific coordinates of each contents in PDF file?

151 views Asked by Keith Lê At 28 November 2023 at 03:36

I use Smalot\PdfParser for extract contents from PDF. As a beginner, I try to mess around with basic functions like getText(), getDetails(), getPages() .etc then I notice this return from $data = dd($page->getDataTM);:

0 => array:4 [▼
    0 => array:6 [▼
      0 => "1.00055"
      1 => "0"
      2 => "0"
      3 => "1"
      4 => "70.8"
      5 => "760.24"
    ]
    1 => " "
    2 => "R8"
    3 => "12"
  ]
  1 => array:4 [▼
    0 => array:6 [▼
      0 => "1.00055"
      1 => "0"
      2 => "0"
      3 => "1"
      4 => "70.8"
      5 => "745.72"
    ]
    1 => "Column1  Column2  Column3 "
    2 => "R10"
    3 => "12"
  ]

So in $data[i][0] got to be the coordinates I need but I dont know which is X, Y or how to specifically extract it.

Here is the code:

use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;

/* ... */

    protected function getCoordinates($pdfPath)
    {
        // get font details by config
        $config = new Config();
        $config->setDataTmFontInfoHasToBeIncluded(true);
        // get PDF structure
        $parser = new Parser([], $config);
        $pdf = $parser->parseFile($pdfPath);
        $coordinates = [];
        //dd($pdf->getPages()[1]->getDataTm());

        foreach ($pdf->getPages() as $page)
        {
            $page->getDataTm();
            $text = $page->getText();
            //$coordinates = ; // This is where I want to extract it
        }
        return $coordinates;
    }

Here is the sample content in PDF I can copy:

Column1 Column2 Column3 L1C1 L1C2 L1C3 L2C1 L2C2 L2C3 L3C1 L3C2 L3C3 L4C1 L4C2 L4C3

It has table and border. What output I expect after extract it to .txt:

[Page : 1, width = 1, height = 2]

[x:0, y:3, w: 4, h:5]Column1 Column2 Column3

[x:6, y:7, w: 8, h:5]L1C1 L1C2 L1C3

[x:6, y:7, w: 8, h:5]L2C1 L2C2 L2C3

[x:6, y:7, w: 8, h:5]L3C1 L3C2 L3C3

[x:6, y:7, w: 8, h:5]L4C1 L4C2 L4C3

Those numbers from to 1 to 8 is I made up from dd($pdf->getPages()[1]->getDataTm());, I see some numbers are the same so those that's why made up numbers aren't many. Also the PDF have more than 1 page too.

Original Q&A

There are 1 answers

**Keith Lê** · Accepted Answer · 2023-11-29T08:14:29+00:00

To get page width and height I use $page->getDetails(); because $page->getDataTm() doesn't have those elements. Here is the code:

protected function getCoordinates($pdfPath)
{
    $config = new Config();
    // add configs stuff
    $parser = new Parser([], $config);
    $pdf = $parser->parseFile($pdfPath);
    $coordinates = [];
    $currentPage = 1;

    foreach ($pdf->getPages() as $page)
    {
        $details = $page->getDetails();
        $coordinates[] = "\n[Page : $currentPage, width = {$details['MediaBox'][2]}, height = {$details['MediaBox'][3]}]";
        foreach($page->getDataTm() as $data)
        {
            $x = $data[0][4];
            // Calculate y from the bottom
            $y = $details['MediaBox'][3] - $data[0][5];
            $w = $data[0][0];
            $h = $data[0][3];
            // Parser add \\r when a line on 1 row is too long so discard it
            $s = mb_convert_encoding(str_replace("\\\r", '', $data[1]), 'UTF-8');
            $coordinates[] = "[x:$x, y:$y, w: $w, h:$h]{$s}";
        }
        $currentPage++;
    }
    if ($coordinates === [])
        return back()->with('error', 'Coordinates not .');
    return $coordinates;
}

Also I don't know why Parser automatically adds \r when a line is too long. When I extract it to a .txt file, it shows up randomly like Column1 Col\rumn2. I installed spatie/pdf-to-text and it doesn't break when a line is too long. But spatie doesn't extract PDF header and coordinates so I have to stick with pdfParser.

TechQA.

How to get the specific coordinates of each contents in PDF file?

There are 1 answers

Related Questions in PHP

Related Questions in PDF

Related Questions in COORDINATES

Related Questions in PDFTOTEXT

Related Questions in PDFPARSER

Popular Questions

Popular Tags

Trending Questions