Is it possible to define a HTML selector that concatenates multiple selectors and separates them by semicolon?

1.3k views Asked by At

I'm trying to parse a simple HTML page with pup. This is a command-line HTML parser and it accepts general HTML selectors.

I want to select:

'div.aclass text{}' #(would be SampleA)

and I also want to select:

'div.bclass text{}' #(would be SampleB)

and I want to concatenate them and insert some custom text to get:

SampleA;MYEXTRASTRING;SampleB

I want to avoid calling pup more than once as it is slow.

I can select multiple tags:

'div.aclass text{}, div.bclass text{}'

but this will result:

SampleA
SampleB

Is there any better choice than pup for this purpose?

(Note: Python is NOT an option as it's very slow for my needs.)

1

There are 1 answers

12
Kevin Cui On

Multiple selectors with pup seem not work, there is an issue here: https://github.com/ericchiang/pup/issues/59

To achieve your purpose, I would suggest to use hxselect command, which can be found inside HTML-XML-utils: https://www.w3.org/Tools/HTML-XML-utils/README

Example:

curl -s http://example.com/ | hxselect -c 'body > div:nth-child(1) > h1:nth-child(1)', 'body > div:nth-child(1) > p:nth-child(3) > a:nth-child(1)' -s ';MYEXTRASTRING;' | sed 's/\(.*\);MYEXTRASTRING;/\1/'

curl part:

curl is used to download html content of http://exmaple.com

hxselect part:

hxselect supports multiple CSS selectors. Use , to separate these selectors.

-c: print content only, without html tag

-s: separator text after each match. In your case, it's ;MYEXTRASTRING;

sed part:

Because -s separator text will be added for each match, it means it will be added twice. sed is used to remove the last matched separator text.