I'm trying to scrape data from the MEPCO bill site in PHP. Specifically, I want to extract the bill details and save them to a database.
Here's an example of the HTML structure from where I want to scrape data:
<html>
<body>
<div id="bill-details">
<h2>Electricity Bill Details</h2>
<p>Payable Amount: $200</p>
<p>Due Date: 2023-05-01</p>
<p>Description: This is your electricity bill for the month of April 2023.</p>
</div>
</body>
</html>
I want to extract the payable amount and due date from this HTML. Here's the code I have tried so far:
$html = '<html>...'; // the HTML from the example above
preg_match('/<h2>(.*)<\/h2>/', $html, $billHeading);
preg_match('/<p>Payable Amount: (.*)<\/p>/', $html, $payableAmount);
preg_match('/<p>Due Date: (.*)<\/p>/', $html, $dueDate);
echo "Bill Heading: ".$billHeading[1];
echo "Payable Amount: ".$payableAmount[1];
echo "Due Date: ".$dueDate[1];
However, this code is not working as expected. It is not extracting the correct payable amount and due date. Can someone please help me correct this code or suggest a better way to extract the data from HTML using PHP?
Your example seems to work, as best as I can tell. This is what I ran:
And that generates this result:
Without knowing what exactly isn't working, it's hard to say what the problem is. As far as ways to improve it, one of the other comments suggested using a specialized DOM parsing library, and I agree with that. If you must rely on regexes, I suggest making the patterns as specific as possible. For example, if the date is always that format, match using something like
(\d{4}-\d{2}-\d{2}).