Categories
Development

Headless Chrome detection and anti-detection

In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.

So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.

See the test results of disguising the browser automation for both Selenium and Puppeteer extra.

Categories
Development

How to remove from JS array duplicate elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove duplicate elements from an array. The magic is to make it a Set and then use Spread syntax to turn it back to array.

links = [];
$('div.items').each((index, el) => { 
    let link = $(el).attr('href');						     
    links.push(link);
}); 
// remove repeating links
links = [...new Set(links)]

See also How to remove from an array empty or undefined elements.

Categories
Development

How to remove from JS array empty or `undefined` elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove empty elements from it.

images = [];
$('div.image').each((index, el) => { 
    let url = $(el).attr('src');						     
    images.push(url);
}); 
// remove invalid images
images = images.filter(function(img){
    return img && !img.includes('undefined')
});

See also How to remove from array duplicate elements.

Categories
Development

Strip HTML tags with and without inner content in JavaScript

function strip_tags(str){
   const tags = ['a', 'em', 'div', 'span', 'p', 'i', 'button', 'img' ];
   const tagsAndContent = ['picture', 'script', 'noscript', 'source'];  	 
   for(tag of tagsAndContent){ 
      let regex = new RegExp( '<' + tag+ '.*?</' + tag + '>', 'gim');
      str = str.replace( regex ,"");
   }
   for(tag of tags){
      let regex1 = new RegExp( '<' + tag+ '.*?>', 'gim');
      let regex2 = new RegExp( '</' + tag+ '>', 'gim');
      str = str.replace(regex1,"").replace(regex2,""); 
   } 
   return str;
}
Categories
Development

JavaScript, Regex match groups

Often we want only a certain info from the matched content. So, groups help with that.

The following example shows how to fetch the [duplicate] entry index from the error message. For that we take 1st group, index “1”:

const regex = /Duplicate entry\s'([^']+)+'/gm;
const str = `{ Error: (conn=42434, no: 1062, SQLState: 23000) Duplicate entry '135' for key 'PRIMARY'
sql: INSERT INTO \`test\` (id , string1) values (?,?) - parameters:[[135,'string 756']]
    at Object.module.exports.createError (C:\\Users\\User\\Documents\\RnD\\Node.js\\mercateo-`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}
Categories
Development

Save input value on page refresh using sessionStorage

Categories
Development

Check if cookies are enabled

function areCookiesEnabled() 
	{
		var cookieEnabled = (navigator.cookieEnabled) ? true : false;
		if (typeof navigator.cookieEnabled == "undefined" && !cookieEnabled)
		{
			document.cookie = "test";
			cookieEnabled = (document.cookie.indexOf("test") != -1) ? true : false;
		}
		return cookieEnabled;
	}

Navigator is the interface represents the state and the identity of the user agent. It allows scripts to query it and to register themselves to carry on some activities.
Navigator object can be retrieved using the read-only window.navigator property.

Categories
Development

Check if sessionStorage is enabled

function isStorageEnabled() {
	try{
		sessionStorage.setItem("test","value");
		if(sessionStorage.getItem("test") == "value") {
			sessionStorage.removeItem("test");
			return true;
		} else {
			return false;
		}
	} catch(err) {
		return false;
	}
}
Categories
Challenge Development

Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Categories
Development

Problem scraping javascript site – help needed

Problem

I am trying to scrape the page https://tienda.mercadona.es/categories/112 and I have installed the docker and followed all the required steps given in the post. Splash works well, but the spyder does not and I don’t know why. The IP of the splash_url is correct but I can’t see in the response object when I write scrapy shell “webpage” the complete page, ie, the page has not rendered correctly.