Categories
Challenge Development

Scraping a Javascript-dependent website with puppeteer

Scraping a Javascript-dependent website with puppeteer
Get eBook (PDF) at low rate to support the web development

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Table of contents

I. Prerequisites
II. Objective
III. Introduction
IV. Getting Started

V. Making a TripAdvisor Scraper

VI. Running the Scraper
VII. Numerical Results

In another post we’ve already shared on the js-driven/js-dependent websites scrape.

 

I. Prerequisites

Basic understanding of HTML, CSS, and Javascript (also Node.js). You should be familiar with the ES6 Javascript syntax below

  • array and object destructuring
  • rest/spread operators
  • string literals
  • async/await
  • Promise

II. Objective

At the end of this article, you will be able to scrape any kind of JS-driven website with great speed while you avoid limitations.

III. Introduction

  • General Introduction

As in 2017, about 94.5% of websites use Javascript and 35% of them require it. Even though web developers provide fallback to browsers that do not support Javascript, most websites today will not function unless Javascript is supported by the browser

  • Essential Difference Between Static Websites and JS-rendered ones

Web scraping started off with making network requests to the server and parsing the returned html markup to get the required data. There are a lot of modules/libraries used in making requests, and we have the ones we use in parsing the returned html markup to get the data we want from a document.

A popular Node.js module for making HTTP requests is request-promise and a common module for parsing HTML markup is cheerio. With cheerio, you are able to use jQuery syntax to extract the data you want from an HTML document.

This style of scraping websites is straight-forward, direct, fast and very performant. But a web scraping professional will not rely on this style of making web scrapers because it is close to being archaic as a lot of websites today return a very minimal HTML markup so that the data you intend to extract are not in the markup.

If you visit the same website with a Javascript-enabled web browser, the data is present. This is due to the fact that the data you want are being rendered with Javascript while the web scraper one makes cannot execute Javascript code. This article guides you through how to start building web crawlers that are able to execute Javascript code and expose an updated Document Object Model which contains data you want.

As a web bot developer, you need to up your game by learning to scrape Javascript-dependent websites effortlessly. Luckily, this guide teaches you all you need to be able to scrape any kind of website.

IV. Getting Started

  • Make sure Node.js and NPM are installed

First check if your Node is installed by running node –version (shortcut: node -v) on the terminal (also called shell or command line but for the purpose of simplicity, let’s stick to terminal throughout this article), and that npm (node package manager) is also installed by running npm –version (shortcut: npm -v). The two should output the current version of Node and npm you are running as shown below:

unless Node is not well installed on your version where you will now have to install it. Make sure you are running Node.js v12 or a later version. It is recommended to be running a stable release.

  • Create a folder and set it up for your web scraping project

A good practice is to have a folder where all your web scraping projects are stored and you should name each folder the name of the project or rather follow a naming convention for your projects. The first step is to create a folder  and navigate to it on the command line (also called terminal or shell). The folder I created for this project is  js-dependent-scrape.

Then run  npm init  in your project directory and you will have an interactive screen waiting for your inputs like below:

As you can see, the package name is having js-dependent-scraping in brackets, which is the name of the folder I ran npm init in. If your project name is the same as the folder name just hit the Enter/Return key and your folder name is used as the package name, otherwise enter the name you would like to name your project before you hit enter.

Provide the necessary inputs for which you are asked, noting that the ones in brackets are the default and hitting the Enter key auto-fills in the default. And if there are any you are unsure of just hit the Enter key to continue.

When done, you will be shown a JSON containing your entries, and you will be asked if the value is okay, with yes in brackets as the default. You can hit Enter if it is okay or type no if otherwise and redo it. You should then have a screen similar to:

You will then have a package.json file created in your directory.

  • Open your project in a IDE
    IDEs come with cool features and add-ons that make writing your codes very easy. Popular IDEs include VS Code, Net Beans, Android Studio, Eclipse, Sublime etc. But I will be using VS Code in this article. Whatever IDE you use, learn how to use it well, and if you do not already have one, download VS Code for your Operating System, install, and start using it.
  • Create necessary files

A good practice for developers to follow is having in their project folder:

  • an  index.js file which is the entry point
  • a  config.json file in which all criteria and configuration settings are saved and can be read from. It is good to have only one source of data so that they can be easily edited
  • a folder named funcs having two files – browser.js  and  scrape.js 

Such that the project directory looks like:

  • Install Puppeteer

Puppeteer is a Node.js library that provides us with an interface to drive a web browser in a way that the browser executes all in-page JS Code, fires necessary JS events and all the other things a browser does. This in turn makes the server believe it is a real user accessing their resources and allows full-access to our script.

Puppeteer is to run Chromium or Firefox in headless and non-headless modes. But as web bot developers, we run in headless mode almost all the time, the only time we run in non-headless mode is when we are debugging or just curious as non-headless makes us have visuals and see what our script is doing. You should always return to headless mode when done.

To install puppeteer version 2.1.1, run  npm install –save puppeteer@2.1.1  (shortcut: npm i -S puppeteer@2.1.1 ) from the terminal in your project directory, and allow it run completely; it will. We have chosen puppeteer 2.1.1 because of its support for keeping application data.

1. Launching a Headless Browser

We will be creating a function in the  funcs/browser.js  file, one launches a browser in headless mode and opens new pages. We will then export these functions so that other files can access it.

A simple function to launch the browser is:

const launchChrome = async () => {
  const puppeteer = require("puppeteer");

  const args = [
    "--disable-dev-shm-usage",
    "--no-sandbox",
    "--disable-setuid-sandbox",
    "--disable-accelerated-2d-canvas",
    "--disable-gpu",
    "--lang=en-US,en"
  ];  let chrome;
  try {
    chrome = await puppeteer.launch({
      headless: true, // run in headless mode
      devtools: false, // disable dev tools
      ignoreHTTPSErrors: true, // ignore https error
      args,
      ignoreDefaultArgs: ["--disable-extensions"],
    });
    return chrome;
  } catch(e) {
    console.error("Unable to launch chrome", e);
    return false;
  }
};

All the above does is launch Chrome, we need to create 2 functions inside this function such that our function returns an array of the 2 functions. The first function creates a new page and the second function exits the launched browser.

It is better to use the async/await syntax (like you see above) when the next statement or expressions or functions depend on the current one.

The newPage function:

const newPage = async () => {
  try {
    const page = await chrome.newPage();
    const closePage = async () => {
      if (!page) return;
      try {
        await page.close();
      } catch(e) {}
    }
    return [page, closePage];
  } catch(e) {
    console.error("Unable to create a new page");
    return [];
  }
};

exitChrome function:

const exitChrome = async () => {
  if (!chrome) return;
  try {
    await chrome.close();
  } catch(e) {}
}

So, the  funcs/browser.js  file looks like:

const launchChrome = async () => {
  const puppeteer = require("puppeteer");

  const args = [
    "--disable-dev-shm-usage",
    "--no-sandbox",
    "--disable-setuid-sandbox",
    "--disable-accelerated-2d-canvas",
    "--disable-gpu",
    "--lang=en-US,en"
  ];  let chrome;
  try {
    chrome = await puppeteer.launch({
      headless: true, // run in headless mode
      devtools: false, // disable dev tools
      ignoreHTTPSErrors: true, // ignore https error
      args,
      ignoreDefaultArgs: ["--disable-extensions"],
     });
  } catch(e) {
    console.error("Unable to launch chrome", e);
    // return two functions to silent errors
    return [() => {}, () => {}];
  }

  const exitChrome = async () => {
    if (!chrome) return;
    try {
      await chrome.close();
    } catch(e) {}
  }

  const newPage = async () => {
    try {
      const page = await chrome.newPage();
      const closePage = async () => {
        if (!page) return;
        try {
          await page.close();
        } catch(e) {}
      }
      return [page, closePage];
    } catch(e) {
      console.error("Unable to create a new page");
      return [];
    }
  };

  return [newPage, exitChrome];
};

module.exports = { launchChrome };
Always use try … catch in blocks that are likely to throw errors, most especially in a Promise or an async block, so that you can catch the errors as they get thrown. Not doing so would stop the execution of the script when the errors are thrown. And the thing about building a Javascript-dependent crawler in Puppeteer is that you might meet a lot of surprises and uncertainties.

In the future, not using try … catch will be deprecated in Javascript.

V. Making a TripAdvisor Scraper

A very good way to learn to scrape Javascript-dependent websites is to work on scraping a real Javascript-dependent website. This is why we have decided to walk you through making a web crawler that signs in to TripAdvisor with the login details you will provide and then scrapes the names, prices, and ratings & reviews of Hotels in London.

We will have to think of how we want our crawler to scrape London Hotels after logging in to TripAdvisor with the details we provide it. We will break down our thought process into smaller ones, this is our flow in creating the web bot.

Scraping Javascript-dependent websites becomes effortless when you have a good thought process – a good breakdown of how the scrape should go.

After manually visiting TripAdvisor and trying to look up a few London Hotels, here is an sample workflow one could come up with:

  1. Launch Chrome and open a new tab/page
  2. Visit TripAdvisor’s home page
  3. Locate and click the Sign In button
  4. Locate and click the Continue with Email button
  5. Read the provided authentication details
  6. Fill in the authentication details
  7. Check for Google Recaptcha and solve it if found. Otherwise, click the Log In button
  8. Locate and click the Hotels link
  9. Select London
  10. Extract the names, services, prices, and ratings & reviews of listed hotels
  11. Close Chrome while still holding the extracted data in memory (RAM)
  12. Save the listed hotels from memory to a json file and then a CSV file

With the above flow, I will be able to successfully scrape TripAdvisor. Let’s follow through while you open  https://www.tripadvisor.com  and developer tools in a guest browser window. Google Chrome is recommended but you could as well use Mozilla Firefox and other modern web browsers that have developer tools where you view HTML DOM tree. So you should have 3 things open, this article, your IDE, your web browser with TripAdvisor’s home page opened and its developer tools.

 

The flow above is what will make the scrapeTripAdvisor function in the  funcs/scrape.js  file, but first let’s input the TripAdvisor login credentials in our config.json file. You remember our  config.json  file is for saving info, preferences, and other variables that will be constant throughout the web bot’s running cycle? Excellent. So, we will open this file and write the TripAdvisor login credentials in there as a JSON string such that our config.json file looks like:

{

“tripAdvisorEmail”: “bot-provided-email@mail.com”,

“tripAdvisorPassword”: “trpAdvsrPss1234”

}

Feel free to prettify the JSON if your IDE supports it.

1. Launching Chrome and opening a new tab/page

First, we will be importing  launchChrome  from  funcs/browser.js  into  funcs/scrape.js  and then using them to ease things up. funcs/scrape.js would look like:

const scrapeTripAdvisor = async () => {
  // import launchChrome and newPage from the browser.js file in the same directory
  const { launchChrome } = require("./browser");

  const [newPage, exitChrome] = await launchChrome();
  const [page] = await newPage();
  if (!page) return;

  await exitChrome();
};

module.exports = scrapeTripAdvisor;

This is a really good interface to program our web crawler on – to bring to life our thoughts and imaginations.

2. Visiting TripAdvisor’s home page

Now, we use the page.goto function to visit urls. It takes two parameters, the first is the URL to visit and the second is an object specifying the options to use in determining when the url is considered opened. The object has only two options:

  • timeout – the maximum amount of time (in milliseconds) to wait for the navigation. 0 disables timeout. The default is 30000 (30 seconds) but can be changed by the page.setDefaultTimeout and page.setDefaultNavigationTimeout methods.
  • waitUntil – determines when the request has been processed and the Promise resolves. It must be any of load (which resolves when the load event has been fired), domcontentloaded (when domcontentloaded event has been fired), networkidle2 (when there are 2 less running network connections for at least 0.5 seconds), and networkidle0 (when there are no more running connections for at least 0.5seconds). The default is load.

page.goto  returns a Promise that gets rejected when the navigation to the url is considered successful depending on the waitUntil option passed.

 

Increase the timeout option when you are on a slow network connection or if you have launched Chrome with proxy and you are not sure how fast the proxy server responds. Be well aware that it might slow down the web crawler; you would however make better crawlers.

 

We will be using only the waitUntil option and we will be setting it to networkidle0 so as to make sure the page.goto function resolves when all the network connections have been settled. We will not be using the timeout option right now.

 

So, our  funcs/scrape.js  file looks like:

const scrapeTripAdvisor = async () => {
  // import launchChrome and newPage from the browser.js file in the same directory
  const { launchChrome } = require("./browser");

  // Flow 1 => Launching chrome and opening a new tab/page
  const [newPage, exitChrome] = await launchChrome();
  const [page] = await newPage();

  // exit the function if the tab is not properly opened
  if (!page) return;

  // Flow 2 => Visiting TripAdvisor's home page
  const url = "https://www.tripadvisor.com/";
  console.log("Opening " + url);
  try {
    await page.goto(url, {
      waitUntil: "networkidle0", // wait till all network requests has been processed
    });
  } catch(e) {
    console.error("Unable to visit " + url, e);
    await exitChrome(); // close chrome on error
    return; // exiting the function
  }  await exitChrome(); // close chrome
};

module.exports = scrapeTripAdvisor;

3. Clicking an element in Puppeteer

The first thing when clicking a button is to know the CSS/JS selector of the button and then click the button by its selector.

  • Knowing the selector of an element

To know the selector of an element, you should open the particular page where the element is in your own browser.

I assume you use a modern browser like Google Chrome, Firefox and the likes, if not please find out how to access the developer tools in your browser and how to inspect an HTML element in your browser.

 

So, right click on the element and click on Inspect, this opens developer tools and shows you the tag name and attributes of the HTML element; you can then use this to compute the selector of the element.

 

You might run into some situations where multiple elements are having the same selector where you would have to employ the use pseudo-classes like first-child, last-child, nth-child, first-of-type, last-of-type and nth-of-type and some others to get more specific selectors.

 

To verify you have gotten a valid selector of the element, use the Console tab of the developer tools to run:

document.querySelector("{THE SELECTED YOU CAME UP WITH}");

If it returns the HTML element in question, you have gotten a valid selector. And if otherwise, you need to retry this process.

Some nodes can only be selected by the use of pseudo-elements.

 

  • Clicking an element by its selector

We will be looking at 3 ways by which we can do this

  • Using the  page.click  function

page.click finds the element’s selector and scrolls it into view if out of view before clicking the element. It returns a Promise that either gets resolved as soon as the element has been clicked or throws an error if the element is not found in the DOM.

It accepts two parameters; the first is the element’s selector and the second is an object that specifies the click options:

    • button – specifies the mouse button to use in clicking the element. It must be a string value of one of left, right, or middle. left means left click, middle means to click with the middle button and right means right click. The default is left
    • clickCount – specifies the number of times to click the element. It must be a number. 2 means double click, 3 means triple click. The default is 1.
    • delay – specifies the time (in milliseconds) to wait between mousedown and mouseup. It must be a number. 2000 means hold the mouse button down for 2 seconds. The default is 0.

       

      Using  clickCount: 3  on an input element highlights the value of the input if any value is there such that the highlighted value gets deleted when you start typing into the input field, while on an unclickable element the innerText of the element, if any, gets highlighted/selected.
      const clickBtn = async () => {
        try {
          // the first element with btn in it class name
          const btnSelector = ".btn";
      
          // left clicks once
          await page.click(btnSelector);
          // right clicks once
          await page.click(btnSelector, { button: "right" });
          // double clicks with the left button
          await page.click(btnSelector, { clickCount: 2 });
          // double clicks with the right button
          await page.click(btnSelector, { button: "right", clickCount: 2 });
      
      
          // select the 7th element with btn in its class name
          const btn7Selector = ".btn:nth-of-type(7)";
      
          // left clicks
          await page.click(btn7Selector);
        } catch(e) {
          console.error("Unable to click button", e);
        }
      };
      
      clickBtn();
      • Using the  page.waitForSelector and then click  function

       page.waitForSelector  waits for an element to appear in the DOM.  It returns a Promise that either gets resolved as (a clickable element’s handle) as soon as the element is found in the DOM or rejected if the element is not found.

      It accepts two parameters. The first is the element’s selector and the second is an object that specifies the wait options:

        • timeout – just like the page.goto timeout option. Except that you can change the default with just the  page.setDefaultTimeout  function and not  page.setDefaultNavigationTimeout.
        • visible – make it wait till the element is visible in the DOM i.e. it does not have any of the CSS   display: none  or   visibility: hidden   declaration. It must be boolean (true or false). The default is false.
        • hidden – make it wait till the element is visible in the DOM i.e. it has either the CSS   display: none  or   visibility: hidden   declaration. It must be boolean (true or false). The default is false.

          You then click the element when  page.waitForSelector  resolves. For example:

          const clickBtn = async () => {
            const btnSelector = ".btn";
            try {
              const btn = await page.waitForSelector(btnSelector);
              await btn.click(); // left clicks once
              await btn.click({ button: "right" }); // right clicks once
              await btn.click({ clickCount: 2 }) // double clicks with left
            } catch(e) {
              console.error("Unable to click button", e);
            }
          };
          
          clickBtn();


          Unlike  page.click  that attempts clicking an element right away,  page.waitForSelector and then click  will click the element right away if it is already in the DOM, and if not, it waits for the element to appear in the DOM for the timeout you set or the default timeout.

          • Using the  page.evaluate  function

          With page.evaluate, you can run a function on the page. It is like running a function in the Console.

          It returns a Promise that resolves to what the function you passed returns or throws an error if any error occurs while trying to run the function in the page context.

          const clickBtn = async () => {
            try {
              await page.evaluate(() => {
                const btnSelector = ".btn";
                const btn = document.querySelector(btnSelector);
                btn.focus();
                btn.click();
              });
            } catch(e) {
              console.error("Unable to click button", e);
            }
          };
          clickBtn();

          A downside of page.evaluate() is that clicking is sort of restricted to using the left button, unlike the two other ways of clicking we examined above, so that you cannot double click, right click, and do some other types of clicks. But a good thing about this page.evaluate is you will be sure that the button gets clicked even if a modal is over it. Also, you can use it to easily click the  nth  type of your selector without using pseudo elements.

           

          const clickBtn = async (nth = 0) => {
          try {
          await page.evaluate(() => {
          const btnSelector = ".btn";
          const btn = document.querySelectorAll(btnSelector)[nth];
          btn.focus();
          btn.click();
          });
          } catch(e) {
          console.error("Unable to click button", e);
          }
          };clickBtn(0); // clicks the 1st occurence of the ".btn" selector
          clickBtn(10); // clicks the 11th occurence of the ".btn" selector
          clickBtn(2); // clicks the 3rd occurence of the ".btn" selector

           

          It is better to prefer the second method ( page.waitForSelector and then click ) over the others because element might not be available and this method tries to wait some milliseconds for the element to appear in the DOM before clicking it. Another thing about this method is you can decide if you want to click the element only when it is visible or hidden.

          You should use the first method ( page.click ) only when you are  down-to-earth-sure the element is present on the page with the selector you had specified.

          You should use  page.evaluate  for clicking only when you want to click the  nth  type of an element without having to use complex  pseudo-classes .

          You should prefer the first two methods over page.evaluate, because they are more human-like. page.evaluate on the other hand is fast, performant, and more bot-like.

           

          Now, we can continue with making a TripAdvisor Scraper

          4. Locating and clicking the Sign In button

          Using the guide Knowing the selector of an element, you should come up with  a[href*=”RegistrationController”]  or something similar as the Sign In button selector.

          Back on the Guest Browser, locate the Sign In button and Right click on it (to Open the context menu) and click on Inspect till the developer tools (like highlighted below) is opened.

          From the highlighted html markup like in the developer tools, you would see how the selector  a[href*=”RegistrationController”]  was computed. Note that it was the span element that was highlighted at first. I then moved to the parentElement, an anchor (A) element which looks more stable than the span.

           

          Getting selectors from html elements (markup) that look more stable would reduce your chances of having to fix the codes just because of a slight change in the markup, which is likely to happen.

           

          We should prefer the first method of Clicking an element by its selectorpage.click here because the Sign In button becomes available as soon as the DOMContentLoaded event fires and from flow 2, we have waited for all network requests to be processed using the  waitUntil: “networkidle0”  option of the page.goto function.

           

          So, the code to locate and click the Sign In button looks like the following:

          const clickSignIn = async () => {
          try {
          await page.click('a[href*="RegistrationController"]');
          } catch(e) {
          console.error("Unable to click the sign in button", e);
          }  try {
          await page.waitForNavigation({ waitUntil: "networkidle0" });
          } catch(e) {
          // console.error("could not wait for navigation");
          }
          };

           

          We have used  page.waitForNavigation  in the above code because after clicking the Sign In button on the TripAdvisor’s homepage, a loader appears on the screen. They mostly specify that some HTTP Requests are awaiting response,  page.waitForNavigation  waits for HTTP Requests to be processed. It returns a Promise that gets resolved when the HTTP Requests are processed and this depends on the waitUntil option. It accepts an object as a parameter. The object contains only two options just like the page.goto option.

          5. Locate and click the Continue with Email button

          Using the guide of knowing the selector of an element, a valid selector for the Continue with Email button is  .regEmailContinue .

          And here, let us use the second method of clicking an element by it’s selector page.waitForSelector and then click.

          Easily, our code looks like:

          const clickContinueBtn = async () => {
          try {
          const continueBtn = await page.waitForSelector(".regEmailContinue");
          await continueBtn.click();
          } catch(e) {
          console.error("Unable to click the Continue with Email button", e);
          }
          };clickContinueBtn();

          6. Reading the provided authentication details

          Remember we saved the authentication details in the config.json file? We will be using the Node.js require function to read the authentication details.

          const { tripAdvisorEmail, tripAdvisorPassword } = require(“../config”);

           

          That is how we read tripAdvisorEmail and tripAdvisorPassword into the memory from our config file.

          7. Writing to an Input Field

          There are 3 methods of writing to an input field, similar to I

          • Using the  page.type  function

          page.type accepts three parameters – the element’s selector, the text to be typed into the selected element, and an object of one option

          delay – specifies the time to wait in milliseconds between pressing each keys

          It returns a Promise that either gets resolved when the text has been typed into the element or throws an error if the element is not in the DOM.

          const writeToInput = async () => {
          try {
          const inputSelector = "#input-field";
          // type instantly
          await page.type(inputSelector, "Good");
          // type a bit slowly like a normal user that types fast
          await page.type(inputSelector, "luck", { delay: 50 });
          } catch(e) {
          console.error("Unable to write to the input field", e);
          }
          };
          writeToInput();
          

           

          • Using the  page.waitForSelector and then type  into the element

          page.waitForSelector waits for the element to be in the DOM and then returns a Promise that resolves to the element handle or throws an error if element is not found. You can use the timeout, visible, and hidden options as well

          And when the element’s handle is returned, type into it.

          const writeToInput = async () => {
          try {
          const inputSelector = "#input-field";
          const input = await page.waitForSelector(inputSelector, { visible: true }); // write only if the input element is visible    // type Yes into the input field
          await input.type("Yes");// highlight/select the content of the input field => "Yes"
          await input.click({ clickCount: 3 });
          // type No
          await input.type("No");// "No" overwrites "Yes" because "Yes" was selected/highlighted using { clickCount: 3 } before typing "No"
          } catch(e) {
          console.error("Unable to write to input", e);
          }
          };
          writeToInput();
          

           

          • Using the  page.evaluate  function

          With page.evaluate, you can pass a function that gets the element by its selector using document.getElementById, document.querySelector, and the likes. You then write into element by settings its value property to the text you want to type:

          const writeToInput = async () => {
          try {
          await page.evaluate(() => {
          const inputElement = document.querySelector("#input-field");
          inputElement.value = "Goodluck!!!";
          });
          } catch(e) {
          console.error("Unable to write to input", e);
          }
          };
          writeToInput();
          

           

          If an input element is focused, you can use page.keyboard.type to type into it. It takes only two arguments, the first is the text you want to write and the second is an option to specify the delay in milliseconds.

          8. Filling in the authentication details

          We want to write the tripAdvisorEmail and the tripAdvisorPassword into the email input field and the password input field respectively. I would be using the second method of Writing to an Input Field here.

           

          const fillInAuthDetails = async () => {
          try {
          const emailField = await page.waitForSelector("#regSignIn.email");
          await emailField.click({ clickCount: 3 });
          await emailField.type(tripAdvisorEmail, { delay: 75 });
          } catch(e) {
          console.error("Unable to write to the email field", e);
          }
          try {
          const passField = await page.waitForSelector("#regSignIn.password");
          await passField.click({ clickCount: 3 });
          await passField.type(tripAdvisorPassword, { delay: 75 });
          } catch(e) {
          console.error("Unable to write to the password field", e);
          }
          };
          fillInAuthDetails();
          

           

          Using the delay option on page.type and elementHandle.type (elementHandle returned from page.waitForSelector) simulates a real user typing experience and fires some keydown, keyup, and keypress events. This gives you some edge and makes you look like a real user. Prefer it over not using delay which types instantly which looks more like a bot to a server.

          9. Check for Google Recaptcha and solve it if found

          Trip Advisor uses Google Recaptcha, but Google Recaptcha uses a kind of algorithm to determine whether to show the captcha or not. This is why you have to first check if Google Recaptcha is present

          We use page.waitForSelector to check if an element is present on a page

          const checkForGRecaptcha = async () => {
          try {
          const recaptchaSelector = ".recaptcha-checkbox"
          await page.waitForSelector(recaptchaSelector);
          console.log("Recaptcha found in page");
          await solveCaptcha();
          } catch(e) {
          console.log("Google Recaptcha not found on the DOM");
          // click the Log in button to continue
          }
          };
          checkForGRecaptcha();
          

           

          Feel free to use options on the waitForSelector for best specific finds

           

          • Solving Captchas

          There are third parties like 2Captcha, Anti Captcha, Death by Captcha. These third parties basically provide a way to send them the information of the captcha challenge you are being faced with and get a solution that can be used to bypass the captcha at an affordable rate. Rates vary by the type of captcha challenge you are trying to solve. They support many forms of captchas, not just Google Recaptcha. It seems like Google Recaptcha is the costliest.

           

          All you need to be able to access their service is Sign up with them, get an API key, and fund your accounts so that the API keys are able to solve challenges. Solving challenges is as simple as sending HTTP Requests to certain API endpoints you will be provided with and getting the response.

           

          We have used 2Captcha well and I trust their service, there is a certain endpoint for initiating a captcha solving task, this endpoint returns the task id which you will use in sending another HTTP request to another endpoint that gives you the captcha solution. For Google Recaptcha, you are required to wait for 20 seconds before calling the endpoint that returns the solutions or CAPTCHA_NOT_READY where you will have to wait for another couple of seconds (20) before calling the endpoint again.

          You will need to create an account to get started. 2Captcha API Documentation is easy to understand and integrate.

          You can use axios, request-promise, or some other Javascript module to make the HTTP requests to the API endpoints after installing them into your project.

          The execution of the script needs to be paused till the captcha solution arrives. You can achieve this by creating a function that returns a Promise, such that the Promise resolves to the solution of the captcha when it arrives. Then you use the async/await syntax to pause execution until the function resolves.

          See an example of how we use the axios module to initiate a captcha solving task, and then get the solution with 2Captcha and then wait for the solution below:

          const axios = require("axios");
          
          const { _2CaptchaAPIKey } = require("../config");
          
          const initCaptchaSolvingTask = (siteKey, url) => {
          return new Promise((resolve, reject) => {
          const url = `https://2captcha.com/in.php?key=${_2CaptchaAPIKey}&method=userrecaptcha&googlekey=${siteKey}&pageurl=${url}invisible=1&soft_id=9120555`;
          return axios
          .get(url, { responseType: "text" })
          .then(({ data }) => resolve(data))
          .catch((e) => {
          console.error("Unable to initiate captcha solving task", e);
          reject(e);
          });
          });
          };
          
          let solveRetryTimes = 0;
          const maxRetryTimes = 5;
          
          const getCaptchaSolution = () => {
          const url = `https://2captcha.com/res.php?key=${_2CaptchaAPIKey}&action=get&id=${initCaptcha}`;
          const waitTime = 20000; // 20 seconds
          
          return new Promise((resolve, reject) => {
          setTimeout(() => {
          axios
          .get(url, { responseType: "text" })
          .then(({ data }) => {
          if (!data || /^CAPT?CHA_NOT_READY$/i.test(data)) {
          console.error(`Captcha not yet solved => ${data}`);
          if (solveRetryTimes > maxRetryTimes) {
          reject(data);
          return;
          }
          console.log("Retrying...");
          solveRetryTimes++;
          return getCaptchaSolution()
          .then((solution) => resolve(solution))
          .catch((e) => console.error(e));
          }
          console.log("Captcha solved " + data);
          resolve(data);
          })
          .catch((e) => {
          console.error("Error getting solved captcha", e);
          if (solveRetryTimes > maxRetryTimes) {
          reject(e);
          return;
          }
          solveRetryTimes++;
          return getCaptchaSolution()
          .then((solution) => resolve(solution))
          .catch((e) => console.error(e));
          });
          }, waitTime);
          });
          };
          

           

          All you need to do is obtain the sitekey and other parameters. Refer to the 2Captcha API Documentation for the best approach on how to obtain it. You will be using the page.evaluate function to obtain the needed parameters.

          The code then looks like:

          const isOk = (val) => {
          const isOk = val.indexOf("OK") === 0;
          return isOk;
          };const solveCaptcha = async () => {
          try {
          let initTaskId = await initCaptchaSolvingTask(siteKey, url);
          console.log("2Captcha Task ID " + initTaskId);
          if (!isOk(initTaskId)) return;// remove "OK|" at the start of the initTaskId
          initTaskId = await initTaskId.substr(3);let captchaSolution = await getCaptchaSolution();
          console.log("only a moment to go...");// remove "OK|" at the start of the captchaSolution
          captchaSolution = await captchaSolution.substr(3);// 2Captcha provides you with a submit to use the captchaSolution and submit the form to sign in. Bear in mind that you will be using the page.evaluate function here
          
          } catch(e) {
          console.error("Unable to solve captcha and login", e);
          }
          }; 
          

           

          Sign in is now successful. We could avoid signing in every time and start using the auto login feature by saving cookies, sessions, local and session storage, caches, and other application data to make things faster and feel more like a browser to the server. To do this, we will do the following:

          • create a folder for the application data in our project directory, let’s name it chrome-data.
          • be modifying our funcs/browser.js file by adding –user-data-dir which points to the chrome data directory.
          const chromeDataDir = require("path").join(__dirname, "../chrome-data");
          // `--user-data-dir=${chromeDataDir}`
          // make sure chromeDataDir exists
          (() => {
          const { existsSync, mkdirSync } = require("fs");
          if (!existsSync(chromeDataDir)) mkdirSync(chromeDataDir);
          })();
          

           

          If you are using Git versioning control be sure to put  chrome-data/  on a new line in the  .gitignore  file for git to ignore our chrome-data directory. It’s personal though.

          Our  funcs/browser.js  now looks like:

          const launchChrome = async () => {
          const puppeteer = require("puppeteer");  const chromeDataDir = require("path").join(__dirname, "../chrome-data");// make sure chromeDataDir exists
          (() => {
          const { existsSync, mkdirSync } = require("fs");
          if (!existsSync(chromeDataDir)) mkdirSync(chromeDataDir);
          })();const args = [
          "--disable-dev-shm-usage",
          "--no-sandbox",
          "--disable-setuid-sandbox",
          "--disable-accelerated-2d-canvas",
          "--disable-gpu",
          "--lang=en-US,en",
          `--user-data-dir=${chromeDataDir}`
          ];  let chrome;
          try {
          chrome = await puppeteer.launch({
          headless: true, // run in headless mode
          devtools: false, // disable dev tools
          ignoreHTTPSErrors: true, // ignore https error
          args,
          ignoreDefaultArgs: ["--disable-extensions"],
          });
          } catch(e) {
          console.error("Unable to launch chrome", e);
          // return two functions to silent errors
          return [() => {}, () => {}];
          }const exitChrome = async () => {
          if (!chrome) return;
          try {
          await chrome.close();
          } catch(e) {}
          }const newPage = async () => {
          try {
          const page = await chrome.newPage();
          const closePage = async () => {
          if (!page) return;
          try {
          await page.close();
          } catch(e) {}
          }
          return [page, closePage];
          } catch(e) {
          console.error("Unable to create a new page");
          return [];
          }
          };
          
          return [newPage, exitChrome];
          };
          
          module.exports = { launchChrome };
          

           

          We can then continue to extract London Hotels from TripAdvisor.

          10. Locate and Click the Hotels Link

          From the guide of Knowing the selector of an element, a good selector for the Hotels link would be  a[href^=”/Hotels”] from the below screenshot:

          Another good selector is a[title*=”Hotels”]  but the title attribute is more likely to change. But href would last longer than title attr. in the sense that href leads to a valid resource on the server unless it is a dummy link.

          Going with a[href^=”/Hotels”] , we should still verify that the selector is well computed by typing document.querySelector(‘a[href^=”/Hotels”]’)  in the console followed by the Enter/Return key. We do this by switching to the Console tab in the developer’s tools like below:

          If the selector is not valid, null is returned in the console. Otherwise, an html markup of the element is returned.

           

          Then, we click using  page.evaluate and the click  like. We use this because after carefully examining TripAdvisor, a modal might be on this link.

          const clickingHotelsLink = async () => {
          try {
          await page.click('a[href^="/Hotels"]');
          } catch(e) {
          console.error("Unable to click the hotels link", e);
          }
          };
          clickingHotelsLink();
          

           

          Knowing your target website gives you a reliable and unstoppable way of scraping them effectively.

           

          After clicking on Hotels, an input element with the title Where to? get focused on. So, we want to type “London” into it. It is good to always wait a few seconds when typing into an input element that autofocuses on page load

          const pageType = async (txt) => {
          try {    await page.waitFor(2500); // wait a bit before typing
          await page.keyboard.type(txt);
          } catch(e) {
          console.error(`Unable to type ${txt}`, e);
          }
          };
          pageType("London");
          // we made pageType re-usable by making it type the first argument passed to it
          

           

          After this we want to click on London (England, United Kingdom) the first result whose selector is form[action=”/Search”] a[href*=”London”] . We do that with page.waitForSelector and then click  because it might take a few seconds for the search result to appear.

          const clickSearchResult = async (n) => {
          try {
          const firstResult = await page.waitForSelector(`form[action="/Search"] a[href*="London"]:nth-of-type(${n})`);
          await firstResult.click();
          try {
          await page.waitForNavigation({ waitUntil: "networkidle0" });
          } catch(e) {}
          } catch(e) {    console.error(`Unable to click result ${n}`);
          }
          };
          clickSearchResult(1); 
          

          Now, we can use page.evaluate to extract the names, services, prices, and ratings & reviews of listed hotels.

          page.evaluate is just like the browser’s console. You can run Javascript codes inside it and have it return what you need. We will harness this to get the names, prices, and ratings and reviews of the listed hotels.

           

          There are 31 hotels listings on the page. We will need the most common selector of the listings. Use the mouse icon highlighted below by clicking on it and the element you want to select. Check the parentElement and ancestors till the nextElement points to another listing.

          We will be using  .listItem  as the selector of each listing because it is common and there are 31 of them.

          We then extract the listings like below:

          const extractHotelsInfo = async () => {
          try {
          const hotelsInfo = await page.evaluate(() => {
          let extractedListings = [];
          const listings = document.querySelectorAll(".listItem");
          const listingsLen = listings.length;      for (let i = 0; i < listingsLen; i++) {
          try {
          const listing = listings[i];
          const name = listing.querySelector(".listing_title a").innerText;
          const services = (listing.querySelector('.icons_list').innerText || "").replace("\n", ", ");
          const price = listing.querySelector(".price").innerText;
          const ratings = listing.querySelector('.ui_bubble_rating').getAttribute('alt');
          const reviews = listing.querySelector(".review_count").innerText;extractedListings.push({ name, services, price, ratings, reviews });
          } catch (e) {}
          }return extractedListings;
          });    // do anything with hotelsInfo    console.log("Hotels Info", hotelsInfo);
          } catch(e) {
          console.error("Unable to extract hotels listings", e);
          }
          };
          extractHotelsInfo(); 
          

          11. Exiting Chrome

          We are now done with the scraping. We simply invoke exitChrome to shut the browser down.

          exitChrome();

           

          We need to exit Chrome or the process will not stop unless it crashes.

          12. Saving the listed hotels from memory

          There are different ways and formats to save the extracted data. But we will be exploring 2 popular options. It is good to create a folder for the extracted data. Let’s create extracted-data in our project directory and we shall be saving the extracted data there. This makes our project more structured and organized

           

          • Saving to JSON files

           

          We will be using the inbuilt fs module of Node.js to write to files. And we need to use the JSON.stringify function to convert our Javascript objects into a JSON string. We then write the JSON string to our (.json) file

          const saveAsJson = (data) => {
          const fileName = "extracted-data/hotels.json";
          const data = JSON.stringify(data);  const { writeFile } = require("fs");
          return writeFile(fileName, data, (err) => {
          if (err) return console.error("Unable to save json to file");
          console.log("Data successfully saved to file in json");
          });
          };
          saveAsJson(hotelsInfo);  
          

           

          • Saving to CSV files

           

          CSV is a nice format to present extracted data to users because the files can be opened with Spreadsheet viewing and/or editing softwares like Microsoft Excel, Google Spreadsheet and the likes. Unlike JSON which is more friendly to developers than to common users.

          There are various Javascript modules to use in writing to JSON files. Anyone could come up with a new one at any time and host it on NPM. But we will be using the objects-to-csv module to do this. This module makes writing Javascript objects to CSV a lot easier:

          const saveAsCsv = async (data) => {
          const fileName = "extracted-data/hotels.csv";  const ObjectsToCsv = require("objects-to-csv");
          const csv = new ObjectsToCsv(data);
          try {
          await csv.toDisk(fileName);
          console.log("Data successfully saved to file in CSV");
          } catch(e) {
          console.error("Unable to save csv to file", e);
          }
          };
          saveAsCsv();
          

           

          The objects-to-csv module needs to be installed. You can install it by running  npm install –save objects-to-csv  on the terminal.

          13. Finishing up

          We edit our  index.js  file so that it runs the scrapeTripAdvisor function after importing it from funcs/scrape.js. Then, our index.js file looks like:

          const scrapeTripAdvisor = require("./funcs/scrape");
          scrapeTripAdvisor();
          

           

          We then edit the “script” section of our  package.json  file to have “start” point to  index.js . Remove “test” unless you have set up tests for this project.  package.json  should look like:

          “scripts”: {
          “start”: “node index”
          }

          VI. Running the Scraper

          Running the scraper is as easy as navigating to your project directory from the terminal and running  npm start . npm start runs  node index – the command we specify in the script section of  package.json

          You could as well run  node index  directly in a terminal/console.

          VII. Numerical Results

          Scrape speed comparison table

          Chromium headless instance* HTTP requests
          Setup time, ms 45000 
          Log-in time, ms 105000 13 
          1 page load time, ms 6 10 

          *based on TripAdvisor scrape

          Regular scraping without launching a Chromium instance seems slower to scrape the pages because it mostly involves making HTTP requests that download the whole HTML document. While headless scraping could make just AJAX (XmlHttp) requests to the server to get just the needed data (hotels in this context) without having to download other unnecessary data.

          The scraper ran for approx. 3 mins. To scrape 30 hotel listings from the first page of TripAdvisor London hotels listings, it took approx. 3.55 mins and 193MB of RAM on my local computer with (4.00GB of RAM and 1.30GHz CPU).

          While for scraping the same amount of data, it ran for approx. 2.36 mins and 225MB of RAM on a remote server (4.00GB of RAM and 2.40GHz CPU). Particular results depend on the available CPU, memory, and network speed.

          Avg. time to launch the headless Chrome instance and sign in is 2.50 mins, and the average scraping speed to scrape the 30 hotel listings off each page is approx. 1 page / 6 ms.

          While a regular scraping without launching a headless browser would take an average of 18 ms to sign in, and the average speed would be 1 page / 10 ms.

           

          VIII. Conclusion

          You can always refer to this scraping guide when having any difficulties in a scraping process. If things get too complicated, you can run the browser with  headless: false  to ease things up because you will be able to see what is going on and debugging will be easier. Be prepared to use proxies to hide your public IP, rotating proxies will be needed if the target website bans IP Address. Having mastered these techniques you will  be able to approach any kind of web scraping the modern and smart way.

          You can clone this example TripAdvisor project code from GitHub . And you can download it as zip. Be sure to download the dependencies by running npm install  before running the scraper with  npm start.

          Leave a Reply

          Your email address will not be published. Required fields are marked *

          This site uses Akismet to reduce spam. Learn how your comment data is processed.