Author: admin

How to remove from JS array duplicate elements

Post author By admin
Post date January 26, 2021
No Comments on How to remove from JS array duplicate elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove duplicate elements from an array. The magic is to make it a Set and then use Spread syntax to turn it back to array.

links = [];
$('div.items').each((index, el) => { 
    let link = $(el).attr('href');						     
    links.push(link);
}); 
// remove repeating links
links = [...new Set(links)]

Tags HTML, Javascript

Development

How to remove from JS array empty or `undefined` elements

Post author By admin
Post date January 26, 2021
No Comments on How to remove from JS array empty or `undefined` elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove empty elements from it.

images = [];
$('div.image').each((index, el) => { 
    let url = $(el).attr('src');						     
    images.push(url);
}); 
// remove invalid images
images = images.filter(function(img){
    return img && !img.includes('undefined')
});

Tags HTML, Javascript

Development

How to check if a target page loads data thru XHR (Ajax)

Post author By admin
Post date January 22, 2021
No Comments on How to check if a target page loads data thru XHR (Ajax)

When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer).

So, I’ve discovered an Apify Web Page Analyzer, a free scraper agent that analyses a target site and returns inclusive JSON data of the target web page. The presence of XHR (AJAX) helps me to decide what type of crawler to use for scraping that website.

Tags Node.js, scraper

Data Science

Classification vs Clustering in Machine Learning

Post author By admin
Post date January 20, 2021
No Comments on Classification vs Clustering in Machine Learning

In the post we share some basics of classification and clustering in Machine learning. We also review some of the cluster analysis methods and algorithms.

Tags clustering, data mining, machine learning

Development

Open CSV file of UTF-8 encoding in Excel

Post author By admin
Post date January 19, 2021
No Comments on Open CSV file of UTF-8 encoding in Excel

Working with generated UTF-8 files it’s a challenge to view them properly in Excel. Gladly there is a method how to properly open them in Excel.

Tags visualization

Data Science

Weibull distribution & sample averages approximation using Python and scipy

Post author By admin
Post date January 12, 2021
No Comments on Weibull distribution & sample averages approximation using Python and scipy

In this post we share how to plot distribution histogram for the Weibull ditribution and the distribution of sample averages as approximated by the Normal (Gaussian) distribution. We’ll show how the approximation accuracy changes with samples volume increase.

One may get the full .ipynb file here.

Tags data mining, Python

Development

Invalid data, what it is?

Post author By admin
Post date January 8, 2021
No Comments on Invalid data, what it is?

Often we see “invalid data”, “clean data”, “normalize data”. What does it mean as to practical data extraction and how does one deal with that? One shot is better than 1000 words though:

Tags data mining

Data Science

Simple text analysis with Python

Post author By admin
Post date January 1, 2021
No Comments on Simple text analysis with Python

Finding the most similar sentence(s) to a given sentence in a text in less than 40 lines of code 🙂

Tags data mining, machine learning, Python

Development

Backconnect Proxy Service with authorization in JAVA

Post author By admin
Post date December 25, 2020
No Comments on Backconnect Proxy Service with authorization in JAVA

Working with a Backconnect proxy service (Oxylab.io) we spent a long time looking for a way to authorize it. Originally we used JSoup to get the web pages’ content. The proxy() method can be used there when setting up the connection, yet it only accepts the host and port, no authentication is possible. One of the options that we found, was the following:

Tags JAVA, proxy, service

Development

Strip HTML tags with and without inner content in JavaScript

Post author By admin
Post date December 15, 2020
No Comments on Strip HTML tags with and without inner content in JavaScript

function strip_tags(str){
   const tags = ['a', 'em', 'div', 'span', 'p', 'i', 'button', 'img' ];
   const tagsAndContent = ['picture', 'script', 'noscript', 'source'];  	 
   for(tag of tagsAndContent){ 
      let regex = new RegExp( '<' + tag+ '.*?</' + tag + '>', 'gim');
      str = str.replace( regex ,"");
   }
   for(tag of tags){
      let regex1 = new RegExp( '<' + tag+ '.*?>', 'gim');
      let regex2 = new RegExp( '</' + tag+ '>', 'gim');
      str = str.replace(regex1,"").replace(regex2,""); 
   } 
   return str;
}

Tags HTML, Javascript