Calling nightmarejs in a loop

110 views Asked by At

I am currently trying to scrape some data from gartner peer insights This is the sample URL: - GPI

If you see I wanted to scrape , the review short description and the long description of each review by iterating through the ul list. This is done using nightmarejs and cheerio

I have the following code:

const got = require('got');
import { html } from 'cheerio';
import cheerio = require('cheerio');
import { children } from 'cheerio/lib/api/traversing';

const Nightmare = require('nightmare');


import { AppConfigService } from './../src/modules/common/services/app-config/app-config.service';
import { APP_CONST } from './../src/modules/common/utils/app.constant';
import { HtmlTestService } from './../src/modules/scrapper/utils/html-test.service';

const ngjs = new Nightmare({ show: true, waitTimeout: 1800000, gotoTimeout: 1800000, loadTimeout: 1700000, executionTimeout: 1800000 });
    


(async function() {
const reviewsMainPageUrl = 'https://gartner.com' + reviewRelativePath; // Please assume the URL provided above

    const respBody = await getRespFromWebScrapingApi(reviewsMainPageUrl);
    const new$ = cheerio.load(respBody);
    const completeNew = cheerio.load(new$.html());
    const data = completeNew('.uxd-truncate-text').text();
    //console.log('data:', data) // Just checking if I am getting proper data

    // Now there are two loops -  one for the reviews in a page and another one for the whole set of pages
    // const browser = await puppeteer.launch({headless: false})
    // const page = await browser.newPage();
    const readReviewList = completeNew('.read-review-link').children();
  // ReadreviewList is the one that I am planning to iterate over , albeit there is a cavaet described below as an end Note
   await scrapeFullReviewNMJS(reviewsMainPageUrl, readReviewList);

})()

async function scrapeFullReviewNMJS(reviewsMainPageUrl: string, readReviewList) {
    
    
    await scrapeOneReview(reviewsMainPageUrl)
}

async function ngjsResult(reviewsMainPageUrl, index=1) {
        console.log('call to result', index)
        return new Promise((resolve, reject) => {
            ngjs
            .goto(reviewsMainPageUrl)
            .wait('#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul')
            .click(`#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul > li:nth-child(${index}) > div > div.read-review-link > button`)
            .wait('#review-nav > li.active > a')
            .evaluate(() => {
                // getting scrape items from review detail page
                let reviwedCustomerDetails: Record<string, string> = {};

                let completeReviewDetails: Record<string, string> = {};
                let otherQAs: Record<string, string>[] = [];

                // construct completeREviewDetails
                const reviewTitle = document.querySelector('div.category.headline.condensed > h2')?.textContent;
                const overallRating = document.querySelector('div.avgStarIcon > span > span').getAttribute('style')?.substr(7)?.replace(/[%]/, '')?.replace(';', '');
                const reviewRatingUseful = document.querySelector('#review-helpful')?.textContent;
                const completeReview = document.querySelector('#sub-head > p > span.commentSuffix')?.textContent;

                completeReviewDetails = {
                    reviewTitle: reviewTitle,
                    reviewRatingUsefulNess: reviewRatingUseful,
                    reviewOverallRating: overallRating,
                    reviewCompleteDetail: completeReview,
                };

                // constrcut reviewerProfile
                const reviewerProfile = document.querySelector('#profile > div > div.user-info.row > div > div.reviewer-title.row > div.col-xs-10.title > span')?.textContent;
                const reviewerIndustry = document.querySelector('#industry > span')?.textContent;
                const reviewerRole = document.querySelector('#roles > span')?.textContent;
                const reviewerIndustrySize = document.querySelector('#companySize > span')?.textContent;
                const reviewerImplementationStratergy = document.querySelector('#profile > div > div.user-info.row > div >' + ' div:nth-child(3) > span')?.textContent;

                reviwedCustomerDetails = {
                    reviewerProfile: reviewerProfile,
                    reviewerIndustry: reviewerIndustry,
                    reviewerRole: reviewerRole,
                    reviewerIndustrySize: reviewerIndustrySize,
                    reviewerImplementationStratergy: reviewerImplementationStratergy,
                };
                // construct otherQAs
                return { rd: completeReviewDetails, rP: reviwedCustomerDetails, url: window.location.href };
            })
            
            .then( async data => {
                console.log('getting data:',data )
                
                resolve(data)
            }).catch(err => {
                console.log('err:', err)
                resolve({})
            })  
        })
        
    
    
}

async function scrapeOneReview(reviewsMainPageUrl) {
    let detailsList = []
    let proceed = true;
    let firstAttempt = true;
    
// THIS IS THE PART WITH the issue. Now when I call it as a single instance nightmarejs call, it works fine.
        const ds = await ngjsResult(reviewsMainPageUrl)
        
        detailsList.push(ds)
        console.log('compl:', detailsList)  
// But if I wanted to loop through it, there comes a problem, I have identified one way to go over this is using like below, but its not dynamic , 
        const ds = await ngjsResult(reviewsMainPageUrl, 1).then(async data => {
          return await ngjsResult(reviewsMaingPageUrl, 2) 
         })
        // .then() has to be appended for the entire list 
        detailsList.push(ds)
        console.log('compl:', detailsList)  
    
        
        
        return Promise.resolve(detailsList)
    }

Note: the ul has a list of

  • and a couple of independant and , so do not assume that the iteration of ul element will give only the desired li items For current purposes, lets us keep that I wanted to iterate over the first two reviews

    Is there a proper way to loop around this and get the desired result?

    UPDATE:

    I did try with a for loop like this

    for(let i=1; i<15; i++) {
            const ds = await ngjsResult(reviewsMainPageUrl, i)
            detailsList.push(ds)
        }
        
    

    but I am getting errors like

    Error: navigation error
        at unserializeError (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:162:13)
        at EventEmitter.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:89:13)
        at Object.onceWrapper (events.js:520:26)
        at EventEmitter.emit (events.js:400:28)
        at EventEmitter.emit (domain.js:470:12)
        at ChildProcess.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:49:10)
        at ChildProcess.emit (events.js:400:28)
        at ChildProcess.emit (domain.js:470:12)
        at emit (internal/child_process.js:910:12)
        at processTicksAndRejections (internal/process/task_queues.js:83:21) {
      code: -3,
      details: 'ERR_ABORTED',
      url: 'https://www.gartner.com/reviews/market/unified-communications-as-a-service-worldwide/vendor/ringcentral/product/ringcentral-office/reviews?marketSeoName=unified-communications-as-a-service-worldwide&vendorSeoName=ringcentral&productSeoName=ringcentral-office'
    }
    

    I understand that the indexes like 2,4 might not have the corresponding li selector (there wont be li:nth-child(2) or 4) because as I said above when I debugged using chrome, I could see that the ul element had other html elements like span and div in the array. But the above errors come for everything , even for a valid li selector like li:nth-child(3) or 6,7,8...

  • 1

    There are 1 answers

    0
    devilpreet On

    You can have multiple instances of nightmare instance which would work faster.

    Based on my experience, This problem is easier to manage if you have two type of scrappers:

    1. Base collector scraper
      This collects all base page details and description urls. Puts these into some storage, database perhaps.
    2. Description url scraper
      This initialize to check description urls not yet scraped and then runs to get details, multiple parallel perhaps

    This includes some overhead for maitaining records but pays off well as helps to implement mechanism like retry, max article collection, date wise decisions

    To Add this beautifully mixes with Apache Nifi, where you can extend on the go(live). Additionally if you design well all vitals/stats are easily visible.