Categories
Development

Cheerio scraper escapes special symbols with html entities when performing .html()

As the developers scrape data off the web, the Cheerio scraper using .html() function returns the special symbols as HTML encoded entities, eg.:
ä as ä
ß as ß


Reply of the Cheerio developer:

(1) It’s not the job of a parser to preserve the original document. 
(2) .html() returns an HTML representation of the parsed document, which doesn’t have to be equal to the original document.
source.

Urls are already encoded (as it is seen in the Chrome inspector)

Yet, some of urls are already html entities encoded,
eg. image url of a product: https://assets.einhell.com/im/imf/y400/900_412992/einhell-classic-inverter-schweissgerät-tc-iw-170-detailbild-ohne-untertitel-8.jpg
It is viewed in the code inspector as the following:

Solution

We are to decode back from HTML entities to the character representation.

  1. HTML entities & hex code table
  2. Regex to find HTML entities: \&\w{4};
  3. The replacement object for HTML entities and hex:
replacer = {'ö': 'ö', 'ö': 'ö',
     'ä': 'ä', 'ä': 'ä',
     'ü': 'ü', 'ü': 'ü',
     'ë': 'ë', 'ë': 'ë',
     'ß': 'ß','ß': 'ß'
};

A JavaScript code to decode the html entities back to the character symbols

var str = '<strong>L&#xE4;nge</strong><strong>H&#xF6;he</strong>... L&#xE4;nge...';
function replace_hex_all(str){ 
	replacer={'&#xF6;': 'ö', '&ouml;': 'ö',
			  '&#xE4;': 'ä', '&auml;': 'ä', 
			  '&#xFC;': 'ü', '&uuml;': 'ü', 
			  '&#xEB;': 'ë', '&euml;': 'ë'  
	};
	for (const [key, value] of Object.entries(replacer)) {
	  //console.log(`${key}: ${value}`);
	  while (str.indexOf(key)!= -1){
		  str = str.replace(key, value);
	  } 
	}  	
	 return str;
} 
console.log('new str: ', replace_hex_all(str));

Code at github: https://github.com/igorsavinkin/einhell-scraper/blob/main/hex_val_to_unicode.js

Categories
Development

Node.js, mariaDB, save data & bulk save

Imstall mariadb package:

npm i mariadb

The code

const config = require("./config");
const db = config.database;
const mariadb = require('mariadb');
const pool = mariadb.createPool({
     host: db.host,
	 user: db.user,
	 password: db.password,
	 database: db.database,
     connectionLimit: 5
});
 
async function asyncSaveDataDB(data) {
  let conn;
  try {
	conn = await pool.getConnection();
	const rows = await conn.query("SELECT 1 as val");
	console.log(rows); //[ {val: 1}, meta: ... ]
	const res = await conn.query("INSERT INTO test (string1) value (?)", [data]);
	console.log(res); // { affectedRows: 1, insertId: 1, warningStatus: 0 }

  } catch (err) {
	throw err;
  } finally {
	if (conn) return conn.end();
  }
}

async function asyncSaveDataBulkDB(arr) {
  let conn;
  try {
	conn = await pool.getConnection();
	conn.batch("INSERT INTO `test` (string1) values (?)", arr)
    .then(res => {
         console.log(res); // 2
    });	 

  } catch (err) {
	throw err;
  } finally {
	if (conn) return conn.end();
  }
}

if (module.parent) {
    module.exports = { asyncSaveDataDB, asyncSaveDataBulkDB }
} else {
    asyncSaveDataBulkDB(['tt6', 'test 8']);
}

Config.js might look like the following:

module.exports = {
  database:{
    host: "185.221.154.249",
	user: "xxxxxxxxx",
	password: "xxxxxxxxx",
	database: 'xxxxxxxxx'
  }
}

Docs on mariaDb with Node.js

Categories
Development

Centos 7, Node.js, MySQL connect

I’ve mariadb installed on my VDS with Centos 7.
I’ve installed mysql npm package:

npm i mysql

Yet requesting that package with

var mysql = require('mysql');

has not provided to the successful connection.
While

var mysql = require('mariadb');

has done.

var mysql = require('mariadb');

var con = mysql.createConnection({
  host: "localhost",
  user: "admin_default",
  password: "xxxxxx",
  database: 'admin_default'
}).then(function (){
   console.log('connected!');
}, function(err){
   console.log(err);
});
//console.log(con);
Categories
Development

How to find out that website is Distil protected?

Given: a webpage to scrape.
If you inspect the DOM tree of that page you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
Categories
Challenge

How Imperva protects against scraping bots

Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:

  1. Cloud Web Application Firewall (WAF)
  2. Bot Protection service (formerly Distil Networks)
  3. IP Reputation Intelligence
  4. Content Delivery Network (CDN)
  5. Attack Analytics solution (eg. DDoS)

As to the protection of the bot scraping activities we mention the following.

Categories
Uncategorized

JAVA, Selenium, headless Chrome, JSoup to scrape data of the web

In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML.

Categories
Development

Save input value on page refresh using sessionStorage

Categories
Development

Check if cookies are enabled

function areCookiesEnabled() 
	{
		var cookieEnabled = (navigator.cookieEnabled) ? true : false;
		if (typeof navigator.cookieEnabled == "undefined" && !cookieEnabled)
		{
			document.cookie = "test";
			cookieEnabled = (document.cookie.indexOf("test") != -1) ? true : false;
		}
		return cookieEnabled;
	}

Navigator is the interface represents the state and the identity of the user agent. It allows scripts to query it and to register themselves to carry on some activities.
Navigator object can be retrieved using the read-only window.navigator property.

Categories
Development

Check if sessionStorage is enabled

function isStorageEnabled() {
	try{
		sessionStorage.setItem("test","value");
		if(sessionStorage.getItem("test") == "value") {
			sessionStorage.removeItem("test");
			return true;
		} else {
			return false;
		}
	} catch(err) {
		return false;
	}
}
Categories
Development

VPS/VDS firewall settings to let Node.js be accessible externally

Problem

I’ve made a simple node.js server at VDS:

var http = require('http');
http.createServer(function (req, res) {
  let port = 9999; 
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.end('Hello World\n');
}).listen(post, '0.0.0.0');
console.log('Server running at port ' + port);

It works outputting:

Server running at port 9999

yet I can’t reach it at VPS/VDS IP where the code is residing: http://webscraping.pro:9999/ How to solve that?