Webscraping with cheerio: Deleting or ignoring a child element?

1.6k views Asked by At

So I have a Website I want to scrape, structured as follows:

<p><strong>some headline:</strong> some content etc. blabla </p>

<p><strong>some other headline:</strong> some more content etc. blabla </p>
// and so on...

I scrape it with cheerio as follows:

$('p strong').each(function(i, element){
      console.log($(this).text()); 
      //gets me the headline

      console.log("Parent:" + $(this).parent().text()); 
      //gets me the content, but unfortunately, also the headline again
    });

For now, I am just logging everything, but later I want to save headlines & content in separate variables. However, since the headline (which is to be found within the <strong> tags) is also part of the <p> tags, my second command (which intends to get content only, no headline, since I already grabbed that) gets not only the content, but also the headline again. How can I separate or delete everything that is in the <strong> tag, and just save all the rest in the <p> tag, i.e. only the content?

1

There are 1 answers

0
T.J. Crowder On BEST ANSWER

Probably simplest to remove the headline element:

$('p strong').each(function(i, element){
  var $this = $(this);
  var headline = $this.text();     // Get headline text
  var parent = $this.parent();     // Get parent
  $this.remove();                  // Remove headline element
  var body = parent.text();        // Get body text
  // ...
});