I have a whole bunch of large HTML documents with tables of data inside and I'm looking to write a script which can process an HTML file, isolate the tags and their contents, then concatenate all the rows within those tables into one large data table. Then loop through the rows and columns of the new large table.
After some research I've started trying out PHP's DOMDocument class to parse the HTML but I just wanted to know, is that the best way to do something like this?
This is what I've got so far...
$dom = new DOMDocument();
$dom->preserveWhiteSpace = FALSE;
@$dom->loadHTMLFile('exrate.html');
$tables = $dom->getElementsByTagName('table');
How do I chop out everything other than the tables and their contents? Then I'd actually like to remove the first table since it's a table of contents. Then loop through all the table rows and build them into one large table.
Anyone got any hints on how to do this? I've been digging through the docs for DOMDocument on php.net but I'm finding the syntax pretty baffling!
Cheers, B
EDIT: Here is a sample of an HTML file with the data tables I'd like to join http://thenetzone.co.uk/exrates/exrate.html
Ok got it sorted with phpQuery and lots of trial and error.
So it takes a whole bunch of tables and moves the contents into the first one, removes the empty tables.
Then loops through each table row and extracts the text from specific columns, in this case the 2nd and 3rd td of each row.
Hope this helps someone out!