I am currently trying to scrape some data from gartner peer insights This is the sample URL: - GPI
If you see I wanted to scrape , the review short description and the long description of each review by iterating through the ul list. This is done using nightmarejs and cheerio
I have the following code:
const got = require('got');
import { html } from 'cheerio';
import cheerio = require('cheerio');
import { children } from 'cheerio/lib/api/traversing';
const Nightmare = require('nightmare');
import { AppConfigService } from './../src/modules/common/services/app-config/app-config.service';
import { APP_CONST } from './../src/modules/common/utils/app.constant';
import { HtmlTestService } from './../src/modules/scrapper/utils/html-test.service';
const ngjs = new Nightmare({ show: true, waitTimeout: 1800000, gotoTimeout: 1800000, loadTimeout: 1700000, executionTimeout: 1800000 });
(async function() {
const reviewsMainPageUrl = 'https://gartner.com' + reviewRelativePath; // Please assume the URL provided above
const respBody = await getRespFromWebScrapingApi(reviewsMainPageUrl);
const new$ = cheerio.load(respBody);
const completeNew = cheerio.load(new$.html());
const data = completeNew('.uxd-truncate-text').text();
//console.log('data:', data) // Just checking if I am getting proper data
// Now there are two loops - one for the reviews in a page and another one for the whole set of pages
// const browser = await puppeteer.launch({headless: false})
// const page = await browser.newPage();
const readReviewList = completeNew('.read-review-link').children();
// ReadreviewList is the one that I am planning to iterate over , albeit there is a cavaet described below as an end Note
await scrapeFullReviewNMJS(reviewsMainPageUrl, readReviewList);
})()
async function scrapeFullReviewNMJS(reviewsMainPageUrl: string, readReviewList) {
await scrapeOneReview(reviewsMainPageUrl)
}
async function ngjsResult(reviewsMainPageUrl, index=1) {
console.log('call to result', index)
return new Promise((resolve, reject) => {
ngjs
.goto(reviewsMainPageUrl)
.wait('#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul')
.click(`#body-container > div > div.col-sm-9 > div > div.product-reviews-snippet-wrapper > ul > li:nth-child(${index}) > div > div.read-review-link > button`)
.wait('#review-nav > li.active > a')
.evaluate(() => {
// getting scrape items from review detail page
let reviwedCustomerDetails: Record<string, string> = {};
let completeReviewDetails: Record<string, string> = {};
let otherQAs: Record<string, string>[] = [];
// construct completeREviewDetails
const reviewTitle = document.querySelector('div.category.headline.condensed > h2')?.textContent;
const overallRating = document.querySelector('div.avgStarIcon > span > span').getAttribute('style')?.substr(7)?.replace(/[%]/, '')?.replace(';', '');
const reviewRatingUseful = document.querySelector('#review-helpful')?.textContent;
const completeReview = document.querySelector('#sub-head > p > span.commentSuffix')?.textContent;
completeReviewDetails = {
reviewTitle: reviewTitle,
reviewRatingUsefulNess: reviewRatingUseful,
reviewOverallRating: overallRating,
reviewCompleteDetail: completeReview,
};
// constrcut reviewerProfile
const reviewerProfile = document.querySelector('#profile > div > div.user-info.row > div > div.reviewer-title.row > div.col-xs-10.title > span')?.textContent;
const reviewerIndustry = document.querySelector('#industry > span')?.textContent;
const reviewerRole = document.querySelector('#roles > span')?.textContent;
const reviewerIndustrySize = document.querySelector('#companySize > span')?.textContent;
const reviewerImplementationStratergy = document.querySelector('#profile > div > div.user-info.row > div >' + ' div:nth-child(3) > span')?.textContent;
reviwedCustomerDetails = {
reviewerProfile: reviewerProfile,
reviewerIndustry: reviewerIndustry,
reviewerRole: reviewerRole,
reviewerIndustrySize: reviewerIndustrySize,
reviewerImplementationStratergy: reviewerImplementationStratergy,
};
// construct otherQAs
return { rd: completeReviewDetails, rP: reviwedCustomerDetails, url: window.location.href };
})
.then( async data => {
console.log('getting data:',data )
resolve(data)
}).catch(err => {
console.log('err:', err)
resolve({})
})
})
}
async function scrapeOneReview(reviewsMainPageUrl) {
let detailsList = []
let proceed = true;
let firstAttempt = true;
// THIS IS THE PART WITH the issue. Now when I call it as a single instance nightmarejs call, it works fine.
const ds = await ngjsResult(reviewsMainPageUrl)
detailsList.push(ds)
console.log('compl:', detailsList)
// But if I wanted to loop through it, there comes a problem, I have identified one way to go over this is using like below, but its not dynamic ,
const ds = await ngjsResult(reviewsMainPageUrl, 1).then(async data => {
return await ngjsResult(reviewsMaingPageUrl, 2)
})
// .then() has to be appended for the entire list
detailsList.push(ds)
console.log('compl:', detailsList)
return Promise.resolve(detailsList)
}
Note: the ul has a list of
Is there a proper way to loop around this and get the desired result?
UPDATE:
I did try with a for loop like this
for(let i=1; i<15; i++) {
const ds = await ngjsResult(reviewsMainPageUrl, i)
detailsList.push(ds)
}
but I am getting errors like
Error: navigation error
at unserializeError (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:162:13)
at EventEmitter.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:89:13)
at Object.onceWrapper (events.js:520:26)
at EventEmitter.emit (events.js:400:28)
at EventEmitter.emit (domain.js:470:12)
at ChildProcess.<anonymous> (/home/vijayakumar/Documents/Code/CHAZE/Nestjs/ap0001-ci-opn-scr/node_modules/nightmare/lib/ipc.js:49:10)
at ChildProcess.emit (events.js:400:28)
at ChildProcess.emit (domain.js:470:12)
at emit (internal/child_process.js:910:12)
at processTicksAndRejections (internal/process/task_queues.js:83:21) {
code: -3,
details: 'ERR_ABORTED',
url: 'https://www.gartner.com/reviews/market/unified-communications-as-a-service-worldwide/vendor/ringcentral/product/ringcentral-office/reviews?marketSeoName=unified-communications-as-a-service-worldwide&vendorSeoName=ringcentral&productSeoName=ringcentral-office'
}
I understand that the indexes like 2,4 might not have the corresponding li selector (there wont be li:nth-child(2) or 4) because as I said above when I debugged using chrome, I could see that the ul element had other html elements like span and div in the array. But the above errors come for everything , even for a valid li selector like li:nth-child(3) or 6,7,8...
You can have multiple instances of nightmare instance which would work faster.
Based on my experience, This problem is easier to manage if you have two type of scrappers:
This collects all base page details and description urls. Puts these into some storage, database perhaps.
This initialize to check description urls not yet scraped and then runs to get details, multiple parallel perhaps
This includes some overhead for maitaining records but pays off well as helps to implement mechanism like retry, max article collection, date wise decisions
To Add this beautifully mixes with Apache Nifi, where you can extend on the go(live). Additionally if you design well all vitals/stats are easily visible.