How to scrape data from MEPCO duplicate bill checking site in PHP?

247 views Asked by At

I'm trying to scrape data from the MEPCO bill site in PHP. Specifically, I want to extract the bill details and save them to a database.

Here's an example of the HTML structure from where I want to scrape data:

<html>
  <body>
    <div id="bill-details">
      <h2>Electricity Bill Details</h2>
      <p>Payable Amount: $200</p>
      <p>Due Date: 2023-05-01</p>
      <p>Description: This is your electricity bill for the month of April 2023.</p>
    </div>
  </body>
</html>

I want to extract the payable amount and due date from this HTML. Here's the code I have tried so far:

$html = '<html>...'; // the HTML from the example above
preg_match('/<h2>(.*)<\/h2>/', $html, $billHeading);
preg_match('/<p>Payable Amount: (.*)<\/p>/', $html, $payableAmount);
preg_match('/<p>Due Date: (.*)<\/p>/', $html, $dueDate);
echo "Bill Heading: ".$billHeading[1];
echo "Payable Amount: ".$payableAmount[1];
echo "Due Date: ".$dueDate[1];

However, this code is not working as expected. It is not extracting the correct payable amount and due date. Can someone please help me correct this code or suggest a better way to extract the data from HTML using PHP?

1

There are 1 answers

0
Tonsil On

Your example seems to work, as best as I can tell. This is what I ran:

<?php

$html = <<<HTML
<html>
  <body>
    <div id="bill-details">
      <h2>Electricity Bill Details</h2>
      <p>Payable Amount: $200</p>
      <p>Due Date: 2023-05-01</p>
      <p>Description: This is your electricity bill for the month of April 2023.</p>
    </div>
  </body>
</html>
HTML;

preg_match('/<h2>(.*)<\/h2>/', $html, $billHeading);
preg_match('/<p>Payable Amount: (.*)<\/p>/', $html, $payableAmount);
preg_match('/<p>Due Date: (.*)<\/p>/', $html, $dueDate);
echo "Bill Heading: '".$billHeading[1] . "'\n";
echo "Payable Amount: '".$payableAmount[1] ."'\n";
echo "Due Date: '".$dueDate[1] ."'\n";

And that generates this result:

Bill Heading: 'Electricity Bill Details'
Payable Amount: '$200'
Due Date: '2023-05-01'

Without knowing what exactly isn't working, it's hard to say what the problem is. As far as ways to improve it, one of the other comments suggested using a specialized DOM parsing library, and I agree with that. If you must rely on regexes, I suggest making the patterns as specific as possible. For example, if the date is always that format, match using something like (\d{4}-\d{2}-\d{2}).