I'm making an application that monitors URLs for changes. To program the application logic I am using Google Apps Script and a Google Sheet.
I explain the monitoring mechanism I have thought of. First of all the Script will read data from a sheet with the following columns:
URL: We indicate the URLs we want to monitor
First Time: Indicates if it is the first time that a URL is analyzed.
Changes: Indicates if changes have been made or not with respect to the previous time it has been analyzed.
HashValue: HTML code of the URL analyzed after applying an MD5 hash.
At the moment of the execution of the script the rows of the sheet will start to be read. For each row:
- The URL will be read and the URLFetchApp method will be executed to get a response from that web page.
- The getContentText method will be applied on the obtained answer to obtain the HTML code of the web page and we will save it in a variable.
- We will apply the MD5 Hash algorithm on the HTML code and we will save it in a variable.
- In case the URL is being analyzed for the first time we will indicate in the column Changes that no changes have been made (it is the first time we analyze it) and we will save in the column HashValue the content of the variable with the hashed HTML code.
- In case the URL has already been analyzed previously, we will compare the previously registered HashValue value with the one we have obtained now.
- In case the value is different we will indicate in the Changes column that there have been changes and we will save in the HashValue column the new hash value.
I have already programmed the code. And it works with some web sites. But with other web sites it does not work. After analyzing the HTML code of the websites where it did not work, looking for differences in the code with an online text comparator I noticed the following:
There are websites in which when reloading twice the same page the code changes a little even if the content is static. For example what can change is that an HTML tag has an ID box-wrap-140 and when reloading the page again the ID is box-wrap-148.
Therefore the script as it is implemented would detect that changes are made, because the HTML code is different. After researching many things I can't find an alternative that solves this problem, hence the question in the title
PS: We can ignore details such as the website not being down or giving us 404, 301, etc. response codes. This has already been programmed and works correctly.
PS2: Sorry for my level of English.
Yon can use cheerio GS to look for custom tags and exclude those changes(
<footer>
) or include those changes(like<div>
).