Calling nightmarejs in a loop

Question

Calling nightmarejs in a loop

110 views Asked by vijayakumarpsg587 At 16 November 2021 at 20:10

I am currently trying to scrape some data from gartner peer insights This is the sample URL: - GPI

If you see I wanted to scrape , the review short description and the long description of each review by iterating through the ul list. This is done using nightmarejs and cheerio

I have the following code:

const got = require('got');
import { html } from 'cheerio';
import cheerio = require('cheerio');
import { children } from 'cheerio/lib/api/traversing';

const Nightmare = require('nightmare');


import { AppConfigService } from './../src/modules/common/services/app-config/app-config.service';
import { APP_CONST } from './../src/modules/common/utils/app.constant';
import { HtmlTestService } from './../src/modules/scrapper/utils/html-test.service';

const ngjs = new Nightmare({ show: true, waitTimeout: 1800000, gotoTimeout: 1800000, loadTimeout: 1700000, executionTimeout: 1800000 });
    


(async function() {
const reviewsMainPageUrl = 'https://gartner.com' + reviewRelativePath; // Please assume the URL provided above

    const respBody = await getRespFromWebScrapingApi(reviewsMainPageUrl);
    const new$ = cheerio.load(respBody);
    const completeNew = cheerio.load(new$.html());
    const data = completeNew('.uxd-truncate-text').text();
    //console.log('data:', data) // Just checking if I am getting proper data

    // Now there are two loops -  one for the reviews in a page and another one for the whole set of pages
    // const browser = await puppeteer.launch({headless: false})
    // const page = await browser.newPage();
    const readReviewList = completeNew('.read-review-link').children();
  // ReadreviewList is the one that I am planning to iterate over , albeit there is a cavaet described below as an end Note
   await scrapeFullReviewNMJS(reviewsMainPageUrl, readReviewList);

})()

async function scrapeFullReviewNMJS(reviewsMainPageUrl: string, readReviewList) {
    
    
    await scrapeOneReview(reviewsMainPageUrl)
}

async function ngjsResult(reviewsMainPageUrl, index=1) {
        console.log('call to result', index)
        return new Promise((resolve, reject) => {
            ngjs
            .goto(reviewsMainPageUrl)
            .wait('#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul')
            .click(`#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul > li:nth-child(${index}) > div > div.read-review-link > button`)
            .wait('#review-nav > li.active > a')
            .evaluate(() => {
                // getting scrape items from review detail page
                let reviwedCustomerDetails: Record<string, string> = {};

                let completeReviewDetails: Record<string, string> = {};
                let otherQAs: Record<string, string>[] = [];

                // construct completeREviewDetails
                const reviewTitle = document.querySelector('div.category.headline.condensed > h2')?.textContent;
                const overallRating = document.querySelector('div.avgStarIcon > span > span').getAttribute('style')?.substr(7)?.replace(/[%]/, '')?.replace(';', '');
                const reviewRatingUseful = document.querySelector('#review-helpful')?.textContent;
                const completeReview = document.querySelector('#sub-head > p > span.commentSuffix')?.textContent;

                completeReviewDetails = {
                    reviewTitle: reviewTitle,
                    reviewRatingUsefulNess: reviewRatingUseful,
                    reviewOverallRating: overallRating,
                    reviewCompleteDetail: completeReview,
                };

                // constrcut reviewerProfile
                const reviewerProfile = document.querySelector('#profile > div > div.user-info.row > div > div.reviewer-title.row > div.col-xs-10.title > span')?.textContent;
                const reviewerIndustry = document.querySelector('#industry > span')?.textContent;
                const reviewerRole = document.querySelector('#roles > span')?.textContent;
                const reviewerIndustrySize = document.querySelector('#companySize > span')?.textContent;
                const reviewerImplementationStratergy = document.querySelector('#profile > div > div.user-info.row > div >' + ' div:nth-child(3) > span')?.textContent;

                reviwedCustomerDetails = {
                    reviewerProfile: reviewerProfile,
                    reviewerIndustry: reviewerIndustry,
                    reviewerRole: reviewerRole,
                    reviewerIndustrySize: reviewerIndustrySize,
                    reviewerImplementationStratergy: reviewerImplementationStratergy,
                };
                // construct otherQAs
                return { rd: completeReviewDetails, rP: reviwedCustomerDetails, url: window.location.href };
            })
            
            .then( async data => {
                console.log('getting data:',data )
                
                resolve(data)
            }).catch(err => {
                console.log('err:', err)
                resolve({})
            })  
        })
        
    
    
}

async function scrapeOneReview(reviewsMainPageUrl) {
    let detailsList = []
    let proceed = true;
    let firstAttempt = true;
    
// THIS IS THE PART WITH the issue. Now when I call it as a single instance nightmarejs call, it works fine.
        const ds = await ngjsResult(reviewsMainPageUrl)
        
        detailsList.push(ds)
        console.log('compl:', detailsList)  
// But if I wanted to loop through it, there comes a problem, I have identified one way to go over this is using like below, but its not dynamic , 
        const ds = await ngjsResult(reviewsMainPageUrl, 1).then(async data => {
          return await ngjsResult(reviewsMaingPageUrl, 2) 
         })
        // .then() has to be appended for the entire list 
        detailsList.push(ds)
        console.log('compl:', detailsList)  
    
        
        
        return Promise.resolve(detailsList)
    }

Note: the ul has a list of

and a couple of independant and , so do not assume that the iteration of ul element will give only the desired li items For current purposes, lets us keep that I wanted to iterate over the first two reviews

Is there a proper way to loop around this and get the desired result?

UPDATE:

I did try with a for loop like this

for(let i=1; i<15; i++) {
        const ds = await ngjsResult(reviewsMainPageUrl, i)
        detailsList.push(ds)
    }

but I am getting errors like

Error: navigation error
    at unserializeError (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:162:13)
    at EventEmitter.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:89:13)
    at Object.onceWrapper (events.js:520:26)
    at EventEmitter.emit (events.js:400:28)
    at EventEmitter.emit (domain.js:470:12)
    at ChildProcess.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:49:10)
    at ChildProcess.emit (events.js:400:28)
    at ChildProcess.emit (domain.js:470:12)
    at emit (internal/child_process.js:910:12)
    at processTicksAndRejections (internal/process/task_queues.js:83:21) {
  code: -3,
  details: 'ERR_ABORTED',
  url: 'https://www.gartner.com/reviews/market/unified-communications-as-a-service-worldwide/vendor/ringcentral/product/ringcentral-office/reviews?marketSeoName=unified-communications-as-a-service-worldwide&vendorSeoName=ringcentral&productSeoName=ringcentral-office'
}

I understand that the indexes like 2,4 might not have the corresponding li selector (there wont be li:nth-child(2) or 4) because as I said above when I debugged using chrome, I could see that the ul element had other html elements like span and div in the array. But the above errors come for everything , even for a valid li selector like li:nth-child(3) or 6,7,8...

Original Q&A

There are 1 answers

**devilpreet** · Answer 1 · 2022-03-08T15:02:52+00:00

You can have multiple instances of nightmare instance which would work faster.

Based on my experience, This problem is easier to manage if you have two type of scrappers:

Base collector scraper
This collects all base page details and description urls. Puts these into some storage, database perhaps.
Description url scraper
This initialize to check description urls not yet scraped and then runs to get details, multiple parallel perhaps

This includes some overhead for maitaining records but pays off well as helps to implement mechanism like retry, max article collection, date wise decisions

To Add this beautifully mixes with Apache Nifi, where you can extend on the go(live). Additionally if you design well all vitals/stats are easily visible.

TechQA.

Calling nightmarejs in a loop

There are 1 answers

Related Questions in NODE.JS

Related Questions in WEB-SCRAPING

Related Questions in NIGHTMARE

Popular Questions

Trending Questions