Avoid reading corrupted file

46 views Asked by At

I currently have a loop that reads HTML codes that are saved in a .txt extension. To read the text inside the HTML code, I have been using gettxt() function from htm2txt package.

The problem is that among about 120,000 files, some of the files are corrupted. To be specific, I tried to manually open the corrupted file by changing the extension to.HTML, but even from the browser, it won't completely read the file even after a very long time. The contents seems to be broken and below the content of the file, there is an infinite amount of white blank space. Because of this, the loop implicitly stops because it takes infinite amount of time to read this file (although the system shows me the loop is still running).

I tired to solve this problem by adding a condition that says "give up reading the txt file if it takes more than 10 seconds to read the file". I used chatGPT to solve this issue, but even after asking multiple times, it did not provide a good answer.

I would be grateful if I could get some help how to solve this issue. If I could detect if a file is corrupted even without reading it, then that would be the best. But if that is not possible, I would like to know how to pause reading the text file and continue to the next one if it takes more than 10 seconds to read the file. Unless necessary, I would like to continue to use gettxt() function to read the txt.

Thank you very much in advance!

0

There are 0 answers