extracting text from html file with bash

687 views Asked by At

I have a script:

cd ../data;
dossier=$(ls crawl);

let "compte = 1";

for file in $dossier
do

lynx --dump --nolist $file >> ../data/txt/$compte'.txt';

let "compte = compte + 1"; 
done 

I am using lynx to retrieve the text from all my HTML files but the problem is that when I open my text file, it is written that:

410 GONE

This doesn't exist any more. Try html.com.

I do not know why because when I am in the terminal and in my crawl-folder, I execute the lynx dump on each HTML file and it is producing the text file but when I want to use it with the script to read all my HTML files and use lynx on them the results are not good.

1

There are 1 answers

0
fernand0 On

You need the protocol and (not sure about this) the path. For example:

lynx -dump file:///where/my/file/is/file.html