Find all HTML files in a set of folders, extract specific HTML content and save content to new files

Question

Find all HTML files in a set of folders, extract specific HTML content and save content to new files

828 views Asked by Richard Saunders At 08 October 2019 at 15:54

I have a folder structure containing thousands of HTML files that I'd like to cleanup and convert to markdown using pandoc, but keep in the exisiting structure (or mirror the structure).

I've currently managed to locate all HTML files using find, passed that content using the cat command to pup which parses the content and looks at the <article> tag and pipes the content to a new file called article-content.txt.

I was thinking of processing the content in two stages.

Extract the article tag from each file and save as a new file (or overwrite the exisiting files).
Then convert the same structure with pandoc.

My understanding of bash is limited. I understand I probably need to loop through the file list and pass the path / filenames as variables into a new file construct. But not sure where to go next.

cat $(find . -type f -name "*.html") | pup 'article' > article-content.txt

Original Q&A

There are 1 answers

**Jeff Y** · Accepted Answer · 2019-10-08T16:56:50+00:00

Jeff Y On 08 October 2019 at 16:56 BEST ANSWER

If you want to perform a similar action on each file individually, find has the -exec and -execdir options built in for just that purpose (see man find):

find . -type f -name "*.html" -execdir bash -c "pup 'article' < {} > {}.md" \;

TechQA.

Find all HTML files in a set of folders, extract specific HTML content and save content to new files

There are 1 answers

Related Questions in BASH

Related Questions in PANDOC

Related Questions in PUP

Popular Questions

Trending Questions