Find all HTML files in a set of folders, extract specific HTML content and save content to new files

828 views Asked by At

I have a folder structure containing thousands of HTML files that I'd like to cleanup and convert to markdown using pandoc, but keep in the exisiting structure (or mirror the structure).

I've currently managed to locate all HTML files using find, passed that content using the cat command to pup which parses the content and looks at the <article> tag and pipes the content to a new file called article-content.txt.

I was thinking of processing the content in two stages.

  1. Extract the article tag from each file and save as a new file (or overwrite the exisiting files).
  2. Then convert the same structure with pandoc.

My understanding of bash is limited. I understand I probably need to loop through the file list and pass the path / filenames as variables into a new file construct. But not sure where to go next.

cat $(find . -type f -name "*.html") | pup 'article' > article-content.txt
1

There are 1 answers

3
Jeff Y On BEST ANSWER

If you want to perform a similar action on each file individually, find has the -exec and -execdir options built in for just that purpose (see man find):

find . -type f -name "*.html" -execdir bash -c "pup 'article' < {} > {}.md" \;