I have a script:
cd ../data;
dossier=$(ls crawl);
let "compte = 1";
for file in $dossier
do
lynx --dump --nolist $file >> ../data/txt/$compte'.txt';
let "compte = compte + 1";
done
I am using lynx
to retrieve the text from all my HTML files but the problem is that when I open my text file, it is written that:
410 GONE
This doesn't exist any more. Try html.com.
I do not know why because when I am in the terminal and in my crawl-folder, I execute the lynx dump on each HTML file and it is producing the text file but when I want to use it with the script to read all my HTML files and use lynx
on them the results are not good.
You need the protocol and (not sure about this) the path. For example: