Tag: Node.js

How to check if a target page loads data thru XHR (Ajax)

Post author By admin
Post date January 22, 2021
No Comments on How to check if a target page loads data thru XHR (Ajax)

When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer).

So, I’ve discovered an Apify Web Page Analyzer, a free scraper agent that analyses a target site and returns inclusive JSON data of the target web page. The presence of XHR (AJAX) helps me to decide what type of crawler to use for scraping that website.

Tags Node.js, scraper

Development

Cheerio scraper escapes special symbols with html entities when performing .html()

Post author By admin
Post date December 1, 2020
1 Comment on Cheerio scraper escapes special symbols with html entities when performing .html()

As developers scrape data off the web, we use Node.js along with handy Cheerio scraper. When fetching .html() Cheerio parser returns the special symbols as HTML encoded entities, eg.:
ä as ä
ß as ß

Cheerio developer vindication of the parser action

(1) It’s not the job of a parser to preserve the original document.
(2) .html() returns an HTML representation of the parsed document, which doesn’t have to be equal to the original document.
source.

Tags Cheerio, Javascript, Node.js

Development

Node.js, mariaDB, save data & bulk save

Post author By admin
Post date November 19, 2020
No Comments on Node.js, mariaDB, save data & bulk save

Imstall mariadb package:

npm i mariadb

The code

const config = require("./config");
const db = config.database;
const mariadb = require('mariadb');
const pool = mariadb.createPool({
     host: db.host,
	 user: db.user,
	 password: db.password,
	 database: db.database,
     connectionLimit: 5
});
 
async function asyncSaveDataDB(data) {
  let conn;
  try {
	conn = await pool.getConnection();
	const rows = await conn.query("SELECT 1 as val");
	console.log(rows); //[ {val: 1}, meta: ... ]
	const res = await conn.query("INSERT INTO test (string1) value (?)", [data]);
	console.log(res); // { affectedRows: 1, insertId: 1, warningStatus: 0 }

  } catch (err) {
	throw err;
  } finally {
	if (conn) return conn.end();
  }
}

async function asyncSaveDataBulkDB(arr) {
  let conn;
  try {
	conn = await pool.getConnection();
	conn.batch("INSERT INTO `test` (string1) values (?)", arr)
    .then(res => {
         console.log(res); // 2
    });	 

  } catch (err) {
	throw err;
  } finally {
	if (conn) return conn.end();
  }
}

if (module.parent) {
    module.exports = { asyncSaveDataDB, asyncSaveDataBulkDB }
} else {
    asyncSaveDataBulkDB(['tt6', 'test 8']);
}

Config.js might look like the following:

module.exports = {
  database:{
    host: "185.221.154.249",
	user: "xxxxxxxxx",
	password: "xxxxxxxxx",
	database: 'xxxxxxxxx'
  }
}

Docs on mariaDb with Node.js

Tags MariaDb, Node.js

Development

Centos 7, Node.js, MySQL connect

Post author By admin
Post date November 18, 2020
No Comments on Centos 7, Node.js, MySQL connect

I’ve mariadb installed on my VDS with Centos 7.
I’ve installed mysql npm package:

npm i mysql

Yet requesting that package with

var mysql = require('mysql');

has not provided to the successful connection.
While

var mysql = require('mariadb');

has done.

var mysql = require('mariadb');

var con = mysql.createConnection({
  host: "localhost",
  user: "admin_default",
  password: "xxxxxx",
  database: 'admin_default'
}).then(function (){
   console.log('connected!');
}, function(err){
   console.log(err);
});
//console.log(con);

Tags Linux, Node.js

Development

VPS/VDS firewall settings to let Node.js be accessible externally

Post author By admin
Post date November 2, 2020
No Comments on VPS/VDS firewall settings to let Node.js be accessible externally

Problem

I’ve made a simple node.js server at VDS:

var http = require('http');
http.createServer(function (req, res) {
  let port = 9999; 
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.end('Hello World\n');
}).listen(post, '0.0.0.0');
console.log('Server running at port ' + port);

It works outputting:

Server running at port 9999

yet I can’t reach it at VPS/VDS IP where the code is residing: http://webscraping.pro:9999/ How to solve that?

Tags Linux, Node.js

Development

Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

Post author By admin
Post date October 26, 2020
No Comments on Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

We recently composed a scraper that works to extract data of a static site. By a static site, we mean such a site that does not utilize JS scripting that loads or transforms on-site data.

If you are interested in a scrape JS-rendered site, please read the following: Scraping a Javascript-dependent website with puppeteer.

Technologies stack

Node.js, the server-side JS environment. The main characteristic of Node.js is the code asynchronous execution.
Apify SDK, the scalable web scraping and crawling library for JavaScript/Node.js. Let’s highlight its excellent characteristics:

automatically scales a pool of headless Chrome/Puppeteer instances
maintains queues of URLs to crawl (handled, pending) – this makes it possible to accommodate crawler possible failures and resumes.
saves crawl results to a convenient [json] dataset (local or in the cloud)
allows proxies rotation

We’ll use a Cheerio crawler of Apify to crawl and extract data off the target site. The target is https://www.ebinger-gmbh.com/.

Tags Node.js

Development

Node.js, Apify, how to fill in RequestQueue from txt file

Post author By admin
Post date October 21, 2020
No Comments on Node.js, Apify, how to fill in RequestQueue from txt file

When working with Apify crawlers, it’s necessary to init RequestQueue. How to fill in RequestQueue from txt file?

Given

A text file with urls to crawl. In our case it’s categories.txt. We’ll use LineReader node package to open and iterate the file line by line.
LineReader to install:

npm i --save line-reader

Since requestQueue methods return Promise, when iterating over the lines of the file we need to apply async function for each line to be added as url into the requestQueue.

The code

const queue_name ='ebinger';
const base_url = 'https://www.ebinger.com/';

Apify.main(async () => {
 
	const requestQueue = await Apify.openRequestQueue(queue_name);
	const lineReader = require('line-reader');
	lineReader.eachLine('categories.txt', async function(line) {
		//console.log('adding ', line);
		let url = base_url + line.trim(); 
		await requestQueue.addRequest({ url: url });
	}); 
	var { totalRequestCount, handledRequestCount, pendingRequestCount, name } = await requestQueue.getInfo();
	console.log(`RequestQueue "${name}" with requests:` );
	console.log(' handledRequestCount:', handledRequestCount);
	console.log(' pendingRequestCount:', pendingRequestCount);
	console.log(' totalRequestCount:'  , totalRequestCount);
...

Tags Node.js

Development

Linkedin scrape guide lines

Post author By admin
Post date August 4, 2020
No Comments on Linkedin scrape guide lines

The LinkedIn crawl success rate is low; one request that a bot makes might require several retries to be successful. So, here we share the crucial Linkedin scraping guide lines.

Rate limit
Limit the crawling rate for LinkedIn. The acceptable approximate frequency is: 1 request every second, 60 requests per minute.
Public pages only
LinkedIn allows for bots only public pages; pages that are private cannot be crawled.

Tags LinkedIn, Node.js, web scraping

Challenge Development

Scraping a Javascript-dependent website with puppeteer

Post author By admin
Post date June 25, 2020
No Comments on Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Tags Javascript, Node.js, scrape protection

Development

New Death By Captcha Node JS and iMacros API examples

Post author By admin
Post date February 21, 2019
No Comments on New Death By Captcha Node JS and iMacros API examples

The deathbycaptcha.com service, one of the oldest and most consistent services in the captcha solving market, has recently added new Node JS API instructions and examples to solve ReCaptcha v2 challenges. Click the link to check the API details!

Tags API, captcha, Node.js, structured APIs