Scrape text from a complex DOM structure

202 views Asked by At

Consider the following hierarchy in DOM

<div class="bodyCells">
    <div style="foo">
       <div style="foo">
           <div style="foo1"> 'contains the list of text elements I want to scrape' </div>
           <div style="foo2"> 'contains the list of text elements I want to scrape' </div>
       </div>
       <div style="foo">
           <div style="foo3"> 'contains the list of text elements I want to scrape' </div>
           <div style="foo4"> 'contains the list of text elements I want to scrape' </div>
       </div>

By using class name bodyCells, I need to scrape out the data from each of the divs one at a time (i.e) Initially from 1st div, then from the next div and so on and store it in separate arrays. How can I possibly achieve this? (using puppeteer)

NOTE: I have tried using class name directly to achieve this but, it gives all the texts in a single array. I need to get data from each tag separately in different arrays.

Expected output:

array1=["text present within style="foo1" div tag"] 
array2=["text present within style="foo2" div tag"] 
array3=["text present within style="foo3" div tag"]
array4=["text present within style="foo4" div tag"]
1

There are 1 answers

2
Tore On

As you noted, you can fetch each of the texts in a single array using the class name. Next, if you iterate over each of those, you can create a separate array for each subsection.

I created a fiddle here - https://jsfiddle.net/32bnoey6/ - with this example code:

const cells = document.getElementsByClassName('bodyCells');

const scrapedElements = [];
for (var i = 0; i < cells.length; i++) {
    const item = cells[i];
  for (var j = 0; j < item.children.length; j++) {
    const outerDiv = item.children[j];
    const innerDivs = outerDiv.children;
    for (var k = 0; k < innerDivs.length; k++) {
        const targetDiv = innerDivs[k];
      scrapedElements.push([targetDiv.innerHTML]);
    }
  }
}

console.log(scrapedElements);