I use Smalot\PdfParser for extract contents from PDF. As a beginner, I try to mess around with basic functions like getText(), getDetails(), getPages() .etc then I notice this return from $data = dd($page->getDataTM);
:
0 => array:4 [▼
0 => array:6 [▼
0 => "1.00055"
1 => "0"
2 => "0"
3 => "1"
4 => "70.8"
5 => "760.24"
]
1 => " "
2 => "R8"
3 => "12"
]
1 => array:4 [▼
0 => array:6 [▼
0 => "1.00055"
1 => "0"
2 => "0"
3 => "1"
4 => "70.8"
5 => "745.72"
]
1 => "Column1 Column2 Column3 "
2 => "R10"
3 => "12"
]
So in $data[i][0]
got to be the coordinates I need but I dont know which is X, Y or how to specifically extract it.
Here is the code:
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;
/* ... */
protected function getCoordinates($pdfPath)
{
// get font details by config
$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
// get PDF structure
$parser = new Parser([], $config);
$pdf = $parser->parseFile($pdfPath);
$coordinates = [];
//dd($pdf->getPages()[1]->getDataTm());
foreach ($pdf->getPages() as $page)
{
$page->getDataTm();
$text = $page->getText();
//$coordinates = ; // This is where I want to extract it
}
return $coordinates;
}
Here is the sample content in PDF I can copy:
Column1 Column2 Column3 L1C1 L1C2 L1C3 L2C1 L2C2 L2C3 L3C1 L3C2 L3C3 L4C1 L4C2 L4C3
It has table and border. What output I expect after extract it to .txt:
[Page : 1, width = 1, height = 2]
[x:0, y:3, w: 4, h:5]Column1 Column2 Column3
[x:6, y:7, w: 8, h:5]L1C1 L1C2 L1C3
[x:6, y:7, w: 8, h:5]L2C1 L2C2 L2C3
[x:6, y:7, w: 8, h:5]L3C1 L3C2 L3C3
[x:6, y:7, w: 8, h:5]L4C1 L4C2 L4C3
Those numbers from to 1 to 8 is I made up from dd($pdf->getPages()[1]->getDataTm());
, I see some numbers are the same so those that's why made up numbers aren't many. Also the PDF have more than 1 page too.
To get page width and height I use
$page->getDetails();
because$page->getDataTm()
doesn't have those elements. Here is the code:Also I don't know why Parser automatically adds
\r
when a line is too long. When I extract it to a.txt
file, it shows up randomly likeColumn1 Col\rumn2
. I installed spatie/pdf-to-text and it doesn't break when a line is too long. But spatie doesn't extract PDF header and coordinates so I have to stick with pdfParser.