Categories
Development

Web Scraping with Node.js

nodejs web scrapingThe web scraping topic has been actively growing in popularity for dozens of years now. Freelance sites are overcrowded with orders connected with this contradictory data extracting process. Today we will combine two new and revolutionary directions in web development. So, let’s consider an elegant and modern way to scrape data from websites with Node.js!

Firstly, a few words about the technology in use. Node.js is a cross-platform server environment, based on V8. Two main Node.js benefits are:

  • Using JavaScript on the back-end
  • Asynchronous programming – when thousands of users are simultaneously connected to the server– Node.js works asynchronously, that is it makes priorities and distributes resources more rationally.

Node.js is usually used for creating API; it is very convenient to create Desktop + Mobile app with Node.js, and, take notice, IoT. The deeper you study it, the more distinctly you will see – it is the future of back-end technologies.

If you don’t know anything about Node.js, a basic understanding of javascript and callback functions will be enough; the other complex code will be explained here.

Modules

Let’s start with overviewing our project. What do we need first? Node.js consists of a lot of useful modules that help you work faster. We will use these:

  • Express: The Node.js framework which allows designing API for mobile and web apps.
  • Fs: File system module. We will use it to write the results into a file.
  • Request: This module provides the simplest way to make http calls.
  • Cheerio: This allows one to use JQuery syntax to parse web data.

Now we will create our project and take some installation steps.

Building a project

To use Node.js you should download it. The installation process is very simple, so right after it’s successfully completed, you can start using it. We will talk about launching a bit later. Now we should create a project and start to install the needed modules.

The project building is as easy as the installation:

  1. Create a folder
  2. Inside the folder create file package.json
  3. Open this file and paste into it the following:
    {
      "name"         : "scrape",
      "version"      : "1.0.0",
      "description"  : "web scraping tutorial",
      "main"         : "server.js",
      "author"       : "Scraping.pro",
      "dependencies" : {
        "express"    : "latest",
        "request"    : "latest",
        "cheerio"    : "latest"
      }
    }

    In the file package.json this basic information is placed: name of the project, project version and description, main file and the author. The dependencies defines all modules and their versions (latest) that will be used in the project.

  4. Now we are going to use a command line, but first we should write some code. Create a server.js file and enter into it the following:
    console.log("Hello!");

    Open it: find your project and enter the command node server – it will print our message in the console.

The basic configuration is done. Now we should install our modules that were mentioned in package.json file. The command

npm install

will download them in our project.

Scraping data with API

So, we have checked the project’s capability and have downloaded the modules. Let’s try to scrape some info. In the first example, we will get data about users on github.com. Fortunately, GitHub has its own open API. We will create a script which loads data about every GitHub user. For the test, we will get info about GitHub co-founder – Linus Torvalds.

// here we initialize our modules - after we will work with them like with objects
var express = require('express');
var fs      = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();
// here you can see how are making routes in Node.js + Express. The first parameter - the route,
// the second is a callback function, which have two default parameters - request and response.
// For example, if you want to get some data from the page you should address from response variable
app.get('/scrape/users', function(req, res){
  // Do you have a Github account? Enter here your login!
  var user = 'torvalds';
  url = 'https://api.github.com/users/' + user;
  // making request with headers, url and callback function
  request({headers: {'user-agent': 'node.js'},url}, function(error, response, html){
    if(!error){
               var result = response.body.split(','); // we separating every row with "," to make output more readable
     }
    // Here we writing our JSON result into a file. The first parameter is the name of the file;
    // the second is our JSON (we use JSON.stringify to print each row on a new line);
    // and callback function to let us know about the status of function
    fs.writeFile('output.json', JSON.stringify(result, null, 4), function(err){
       console.log('File successfully written!');
    })
    res.send('Check your console!')
  })
})
app.listen('8081') // creating listener. Our project will be accessible on http://localhost:8081

Now we are going to visit to http://localhost:8081/scrape/users. We should see our message generated by res.send(‘Check your console!’):

node-js-browserEverything is alright.

What have we got in the command line?

C:\Users\User\Documents\RnD\scrape-node-js>node server
File successfully written!

Awesome! Finally, we should check our file output.json (it is created automatically in your project folder):

[
 "{login:torvalds",
 "id:1024025",
 "avatar_url:https://avatars0.githubusercontent.com/u/1024025?v=4",
 "gravatar_id:",
 "url:https://api.github.com/users/torvalds",
 "html_url:https://github.com/torvalds",
 "followers_url:https://api.github.com/users/torvalds/followers",
 "following_url:https://api.github.com/users/torvalds/following{/other_user}",
 "gists_url:https://api.github.com/users/torvalds/gists{/gist_id}",
 "starred_url:https://api.github.com/users/torvalds/starred{/owner}{/repo}",
 "subscriptions_url:https://api.github.com/users/torvalds/subscriptions",
 "organizations_url:https://api.github.com/users/torvalds/orgs",
 "repos_url:https://api.github.com/users/torvalds/repos",
 "events_url:https://api.github.com/users/torvalds/events{/privacy}",
 "received_events_url:https://api.github.com/users/torvalds/received_events",
 "type:User",
 "site_admin:false",
 "name:Linus Torvalds",
 "company:Linux Foundation",
 "blog:",
 "location:Portland",
 " OR",
 "email:null",
 "hireable:null",
 "bio:null",
 "public_repos:6",
 "public_gists:0",
 "followers:72088",
 "following:0",
 "created_at:2011-09-03T15:26:22Z",
 "updated_at:2017-11-14T16:54:03Z}"
]

Quite simple, isn’t it? Some specific components might seem to be complex, but after an hour of practice you are certain to create this type of script like an expert.

Scraping the website content

Now we will scrape text information from the website. To practice our new skills we will work with a typical blog to get each article title. We will practice on the typical blog greencly.com.

The first thing we need is to “make friends” with the site’s html-code.

Go to the https://greencly.com/, open the developer tools (F12) and let’s investigate the source code. For example, we want to get all article titles from the main page

We see that every title is wrapped up by <h1 class = “entry-title”></h1>. To get it we will use the above-mentioned package cheerio.

The code is similar to our script above:

var express = require('express');
var fs      = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();
app.get('/scrape/content', function(req, res){
  // Let's scrape Greencly blog
  url = 'http://greencly.com/en';
  var arr = [];
  request(url, function(error, response, html){
      if(!error){
            var $ = cheerio.load(html)
            // It's all like in JQuery - we get the elements with "entry-title" class and get the content by cycle
            $('.entry-title').each(function(i, elem){
            arr[i] = $(this).text().trim();
            });
      }
      fs.writeFile('content.json', JSON.stringify(arr, null, 4), function(err){
            console.log('File successfully written! - Check your project directory for the output.json file');
      })
      res.send('Check your console!')
  })
})
app.listen('8081')

And here is our output:

[
    "Dive into ICO: grabbing with the most popular topic nowadays!",
    "Advanced Cash - reliable and convenient payment system for everyone!",
    "Growing Сrystal –  project is closed",
    "Primex – project is closed",
    "FOREVER MONEY LIMITED – project is closed"
]

Scraping with Node.js is a like an art, isn’t it?

Making conclusions

Web scraping is an engaging experience. We strongly recommend for you to go deeper in this theme to explore some other amazing features about scraping with Node.js, but remember – use gained knowledge only in legal directions.

To become a guru in Node.js scraping we recommend that you read the following 4 articles (the first post is this very post):

Node.js, Puppeteer and Apify for scraping business directory

Useful Node.js tutorials

Cheerio

One reply on “Web Scraping with Node.js”

Leave a Reply to Nicolas Makelberge Cancel reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.