Today, I got in touch with the Node.js [and Python] bots garden/zoo providing modern bots with different kinds of browsers (Firefox, Chrome, Headless/not headless) using different automation frameworks (Puppeteer, Selenium, Playwright) in several programming languages.
Tag: node.js
Recently we’ve got a tricky website of dynamic content to scrape. The data are loaded thru XHRs into each part of the DOM (HTML markup). So, the task was to develop an effective scraper that does async while using reasonable CPU recourses.

What is MERN?
The MERN stack is a set of frameworks and tools used for developing a software product. They are very specifically chosen to work together in creating a well-functioning software (see a MERN app code at the post bottom).
node.exe index.js > scrape.log 2>&1
When executing file index.js we redirect all the console.log() output from console into a file scrape.log .
let table = $('table');
if ($(table).has('br')) {
$("br").replaceWith(" ");
}

In the previous post we shared how to disguise Selenium Chrome automation against Fingerprint checks. In this post we share the Puppeteer-extra with Stealth plugin to do the same. The test results are available as html files and screenshots.
When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer).
So, I’ve discovered an Apify Web Page Analyzer, a free scraper agent that analyses a target site and returns inclusive JSON data of the target web page. The presence of XHR (AJAX) helps me to decide what type of crawler to use for scraping that website.
Imstall mariadb package:
npm i mariadb
The code
const config = require("./config");
const db = config.database;
const mariadb = require('mariadb');
const pool = mariadb.createPool({
host: db.host,
user: db.user,
password: db.password,
database: db.database,
connectionLimit: 5
});
async function asyncSaveDataDB(data) {
let conn;
try {
conn = await pool.getConnection();
const rows = await conn.query("SELECT 1 as val");
console.log(rows); //[ {val: 1}, meta: ... ]
const res = await conn.query("INSERT INTO test (string1) value (?)", [data]);
console.log(res); // { affectedRows: 1, insertId: 1, warningStatus: 0 }
} catch (err) {
throw err;
} finally {
if (conn) return conn.end();
}
}
async function asyncSaveDataBulkDB(arr) {
let conn;
try {
conn = await pool.getConnection();
conn.batch("INSERT INTO `test` (string1) values (?)", arr)
.then(res => {
console.log(res); // 2
});
} catch (err) {
throw err;
} finally {
if (conn) return conn.end();
}
}
if (module.parent) {
module.exports = { asyncSaveDataDB, asyncSaveDataBulkDB }
} else {
asyncSaveDataBulkDB(['tt6', 'test 8']);
}
Config.js might look like the following:
module.exports = {
database:{
host: "185.221.154.249",
user: "xxxxxxxxx",
password: "xxxxxxxxx",
database: 'xxxxxxxxx'
}
}
Centos 7, Node.js, MySQL connect
I’ve mariadb installed on my VDS with Centos 7.
I’ve installed mysql npm package:
npm i mysql
Yet requesting that package with
var mysql = require('mysql');
has not provided to the successful connection.
While
var mysql = require('mariadb');
has done.
var mysql = require('mariadb');
var con = mysql.createConnection({
host: "localhost",
user: "admin_default",
password: "xxxxxx",
database: 'admin_default'
}).then(function (){
console.log('connected!');
}, function(err){
console.log(err);
});
//console.log(con);
Problem
I’ve made a simple node.js server at VDS:
var http = require('http');
http.createServer(function (req, res) {
let port = 9999;
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(post, '0.0.0.0');
console.log('Server running at port ' + port);
It works outputting:
Server running at port 9999
yet I can’t reach it at VPS/VDS IP where the code is residing: http://webscraping.pro:9999/ How to solve that?