Categories
Development

Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

We recently composed a scraper that works to extract data of a static site. By a static site, we mean such a site that does not utilize JS scripting that loads or transforms on-site data.

If you are interested in a scrape JS-rendered site, please read the following: Scraping a Javascript-dependent website with puppeteer.

Technologies stack

  1. Node.js, the server-side JS environment. The main characteristic of Node.js is the code asynchronous execution.
  2. Apify SDK, the scalable web scraping and crawling library for JavaScript/Node.js. Let’s highlight its excellent characteristics:
  • automatically scales a pool of headless Chrome/Puppeteer instances
  • maintains queues of URLs to crawl (handled, pending) – this makes it possible to accommodate crawler possible failures and resumes.
  • saves crawl results to a convenient [json] dataset (local or in the cloud)
  • allows proxies rotation

    We’ll use a Cheerio crawler of Apify to crawl and extract data off the target site. The target is https://www.ebinger-gmbh.com/.

Get categories links – initial urls for crawler

First, we find all the categories’ links of the website. For that we used the Scraper, Google Chrome extension:

This will provide us with some links based on Xpath. Edit Xpath to have all possible categories be included.

Edit Xpath at step 1 in that picture.

Now we have a categories file, categories.txt

Cheerio Crawler

Let’s copy a Cheerio Crawler from the Apify official site (do not forget npm i apify –save). As usual in a new folder we init node project and crate index.js file. Now we’ll customize the crawler. Let’s highlight its important features.

Init crawler queue with categories urls

The following code inserts categories links into the crawler request queue:

const lineReader = require('line-reader');
lineReader.eachLine('categories.txt', async function(line) { 
	let url = base_url + line.trim(); 
	await requestQueue.addRequest({ url: url });
}); 

Main crawler function

handlePageFunction() is the main crawler function in index.js file that is called in order for each page to be fetched by a single request to perform the logic of the crawler.

Discern between (1) category page and (2) product page

Since we store both kinds of urls (category page and product page) into the same queue, when processing each inside a crawler, we need to apply if/else :

// detail page or category page ?
 if (request.url.includes("/id-") {...}
  1. From categories pages we retrieve pagination links and single product pages:
// add navigation links into requestQueue
$('.link').each((index, el) => { 
        requestQueue.addRequest({ url: base_url + $(el).attr('href').trim() , forefront: true }); 				
}); 
// get product pages
$('h2 a').each((index, el) => {
	requestQueue.addRequest({ url: base_url + $(el).attr('href').trim() }); 		 
});  
  1. At the product page we retrieve product data and push it both in Apify dataset and in the custom array:
var images=[];
$('div.thumbnails figure.image_container a').each((index, el) => { 	images.push(base_url + $(el).attr('href').trim());
});
// Store the results to the default dataset. In local configuration,
// the data will be stored as JSON files in ./apify_storage/datasets/default
let item = {
	url: request.url
	, images: images
	, name: $('div.details h1').text()
	, sku: $('div.details div.sku').text().split(' ')[1] 
	, weight: $('div.details div.weight').text().split('Versandgewicht:')[1].trim()
	, short_descr: $('div.details div.teaser').text()
	, description: $('div.details div.description').text().replace(/[\n\t\r]/g," ").replace(/\s+/g," ").split('Beschreibung')[1].trim()
	, price: $('div.details div.price div.sum').text().replace(/[\n\t\r]/g," ").replace(/\s+/g," ").trim()
	, info: $('div.details div.price div.info').text().replace(/[\n\t\r]/g," ").replace(/\s+/g," ").trim()
	, delivery_time: $('div.price div:nth-of-type(3)').text().split('Lieferzeit:')[1].trim()			
};
	await dataset.pushData(item);
	total_data.push(item);

Save results into CSV file

Since Apify stores data locally as JSON, and we need to deliver data as CSV, we’ll be using csv-writer package. The array where we push data is total_data.

Storing data into total_data array and later to CSV eliminates a need for using Apify’s own dataset, but we have still left it inside the crawler code.

Execution result and code repo

The result (~1500 entries) was fetched at local run thru asynchronous runs within a minute. The code repo is here: https://github.com/igorsavinkin/ebinger.

Conclusion

The Node.js asynchronous runtime environment offers a new era web development, where no threads are needed but parallel async runs do their excellent jobs.

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.