Categories
Development

How to remove from JS array duplicate elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove duplicate elements from an array. The magic is to make it a Set and then use Spread syntax to turn it back to array.

links = [];
$('div.items').each((index, el) => { 
    let link = $(el).attr('href');						     
    links.push(link);
}); 
// remove repeating links
links = [...new Set(links)]

See also How to remove from an array empty or undefined elements.

Categories
Development

How to remove from JS array empty or `undefined` elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove empty elements from it.

images = [];
$('div.image').each((index, el) => { 
    let url = $(el).attr('src');						     
    images.push(url);
}); 
// remove invalid images
images = images.filter(function(img){
    return img && !img.includes('undefined')
});

See also How to remove from array duplicate elements.

Categories
Development

How to check if a target page loads data thru XHR (Ajax)

When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer).

So, I’ve discovered an Apify Web Page Analyzer, a free scraper agent that analyses a target site and returns inclusive JSON data of the target web page. The presence of XHR (AJAX) helps me to decide what type of crawler to use for scraping that website.

Categories
Development

Open CSV file of UTF-8 encoding in Excel

Working with generated UTF-8 files it’s a challenge to view them properly in Excel. Gladly there is a method how to properly open them in Excel.

Categories
Development

Invalid data, what it is?

Often we see “invalid data”, “clean data”, “normalize data”. What does it mean as to practical data extraction and how does one deal with that? One shot is better than 1000 words though:

Categories
Development

Backconnect Proxy Service with authorization in JAVA

Working with a Backconnect proxy service (Oxylab.io) we spent a long time looking for a way to authorize it. Originally we used JSoup to get the web pages’ content. The proxy() method can be used there when setting up the connection, yet it only accepts the host and port, no authentication is possible. One of the options that we found, was the following:

 

Categories
Development

Strip HTML tags with and without inner content in JavaScript

function strip_tags(str){
   const tags = ['a', 'em', 'div', 'span', 'p', 'i', 'button', 'img' ];
   const tagsAndContent = ['picture', 'script', 'noscript', 'source'];  	 
   for(tag of tagsAndContent){ 
      let regex = new RegExp( '<' + tag+ '.*?</' + tag + '>', 'gim');
      str = str.replace( regex ,"");
   }
   for(tag of tags){
      let regex1 = new RegExp( '<' + tag+ '.*?>', 'gim');
      let regex2 = new RegExp( '</' + tag+ '>', 'gim');
      str = str.replace(regex1,"").replace(regex2,""); 
   } 
   return str;
}
Categories
Development

JavaScript, Regex match groups

Often we want only a certain info from the matched content. So, groups help with that.

The following example shows how to fetch the [duplicate] entry index from the error message. For that we take 1st group, index “1”:

const regex = /Duplicate entry\s'([^']+)+'/gm;
const str = `{ Error: (conn=42434, no: 1062, SQLState: 23000) Duplicate entry '135' for key 'PRIMARY'
sql: INSERT INTO \`test\` (id , string1) values (?,?) - parameters:[[135,'string 756']]
    at Object.module.exports.createError (C:\\Users\\User\\Documents\\RnD\\Node.js\\mercateo-`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}
Categories
Development

Cheerio scraper escapes special symbols with html entities when performing .html()

As developers scrape data off the web, we use Node.js along with handy Cheerio scraper. When fetching .html() Cheerio parser returns the special symbols as HTML encoded entities, eg.:
ä as &auml;
ß as &#xDF;

Cheerio developer vindication of the parser action

(1) It’s not the job of a parser to preserve the original document. 
(2) .html() returns an HTML representation of the parsed document, which doesn’t have to be equal to the original document.
source.

Categories
Development

Node.js, mariaDB, save data & bulk save

Imstall mariadb package:

npm i mariadb

The code

const config = require("./config");
const db = config.database;
const mariadb = require('mariadb');
const pool = mariadb.createPool({
     host: db.host,
	 user: db.user,
	 password: db.password,
	 database: db.database,
     connectionLimit: 5
});
 
async function asyncSaveDataDB(data) {
  let conn;
  try {
	conn = await pool.getConnection();
	const rows = await conn.query("SELECT 1 as val");
	console.log(rows); //[ {val: 1}, meta: ... ]
	const res = await conn.query("INSERT INTO test (string1) value (?)", [data]);
	console.log(res); // { affectedRows: 1, insertId: 1, warningStatus: 0 }

  } catch (err) {
	throw err;
  } finally {
	if (conn) return conn.end();
  }
}

async function asyncSaveDataBulkDB(arr) {
  let conn;
  try {
	conn = await pool.getConnection();
	conn.batch("INSERT INTO `test` (string1) values (?)", arr)
    .then(res => {
         console.log(res); // 2
    });	 

  } catch (err) {
	throw err;
  } finally {
	if (conn) return conn.end();
  }
}

if (module.parent) {
    module.exports = { asyncSaveDataDB, asyncSaveDataBulkDB }
} else {
    asyncSaveDataBulkDB(['tt6', 'test 8']);
}

Config.js might look like the following:

module.exports = {
  database:{
    host: "185.221.154.249",
	user: "xxxxxxxxx",
	password: "xxxxxxxxx",
	database: 'xxxxxxxxx'
  }
}

Docs on mariaDb with Node.js