Scrape date from news articles

30 views Asked by At

I need some brainstorming on this issue. So the problem is I have a list of urls and they are all news articles. I need to scrape the published date of the news articles. The problem is that only a few of them articles have a date written in date tag in html and most of them has date written in different tags. Now I am unable to find a generic approach to get date from all the urls because they all have dates written in different tags. A few examples of published dates of the articles:

<div style="vertical-align: bottom; float:left; width:45%;">
  <b>As on: June 19, 2023 </b>
  <br><br><br>
</div>
<ul>
  <li>Updated Mar 14, 2024, 7:25 AM IST</li>
</ul>

How can I identify these dates?

One solution can be that I use regex to get dates but it will fetch dates written in the article as well. It won't be able to distinguish between the published date and some random date written inside of the article.

0

There are 0 answers