Tag: Javascript

How to bypass PerimeterX

You’ve found the website you need to scrape, set up your scraper and fired it, just to sadly realize PerimeterX has blocked you.

PerimeterX’s dynamically complex bot detection system relies on server-side and client-side checks to distinguish humans from bots. It deploys several layers of protection and, for the most part, manages to do its job without interrupting the user experience.

But don’t fall into despair! There are a couple of things you can try to bypass PerimeterX (called HUMAN now) before giving up on your goal of scraping that delicious data.

Tags anti-scrape, Javascript, scrape detection, Selenium

Development

Puppeteer async scraper with browsers number to be tuned based on CPU capacity

Post author By admin
Post date February 9, 2023
1 Comment on Puppeteer async scraper with browsers number to be tuned based on CPU capacity

Recently we’ve got a tricky website of dynamic content to scrape. The data are loaded thru XHRs into each part of the DOM (HTML markup). So, the task was to develop an effective scraper that does async while using reasonable CPU recourses.

Tags automation, browser-automation, Javascript, Node.js

Development

Simple Apify Puppeteer crawler

Post author By admin
Post date February 21, 2022
No Comments on Simple Apify Puppeteer crawler

const Apify = require('apify');
var total_data=[];
const regex_name = /[A-Z][a-z]+\s[A-Z][a-z]+(?=\.|,|\s|\!|\?)/gm
const regex_address = /stand:(<\/strong>)?\s+(\w+\s+\w+),?\s+(\w+\s+\w+)?/gm;
const regex_email = /(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))/i;
Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue('123');
    await requestQueue.addRequest(new Apify.Request({ url: 'https://www.freeletics.com/de/pages/imprint/' }));
    await requestQueue.addRequest(new Apify.Request({ url: 'https://di1ara.com/pages/impressum' }));
	console.log('\nStart PuppeteerCrawler\n');
    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        handlePageFunction: async ({ request, page }) => {
            const title = await page.title();
            console.log(`Title of ${request.url}: ${title}`);
			const page_content = await page.content();
            console.log(`Page content size:`, page_content.length);
            let obj = { 'url' : request.url }; 
	 
			console.log('Names:');
			while ((m = regex_name.exec(page_content)) !== null) {
				// This is necessary to avoid infinite loops with zero-width matches
				if (m.index === regex_name.lastIndex) {
					regex_name.lastIndex++;
				}
				
				// The result can be accessed through the `m`-variable.
				m.forEach((match, groupIndex) => {
					console.log(`Found match, group ${groupIndex}: ${match}`);
					if (match !='undefined' ) { 
						obj['names'] +=  match + ', ';
					}
				}); 
				
				
			}	
			console.log('\nAddress:');
			while ((m = regex_address.exec(page_content)) !== null) {
				// This is necessary to avoid infinite loops with zero-width matches
				if (m.index === regex_address.lastIndex) {
					regex_address.lastIndex++;
				}
				
				// The result can be accessed through the `m`-variable.
				m.forEach((match, groupIndex) => {
					console.log(`Found match, group ${groupIndex}: ${match}`);
				});
				m[0] = m[0].includes('</strong>') ? m[0].split('</strong>')[1] : m[0];
				m[0] = m[0].replace('<', '');
				obj['address']= m[0] ?? '';
			}
			console.log('\Email:');
			while ((m = regex_email.exec(page_content)) !== null) {
				// This is necessary to avoid infinite loops with zero-width matches
				if (m.index === regex_email.lastIndex) {
					regex_email.lastIndex++;
				}
				
				// The result can be accessed through the `m`-variable.
				m.forEach((match, groupIndex) => {
					console.log(`Found match, group ${groupIndex}: ${match}`);
				}); 
				if (m[0]) 
				{
					obj['email'] = m[0];
					break;
				}
			}
			total_data.push(obj);
			console.log(obj);
        },
        maxRequestsPerCrawl: 2000000,
        maxConcurrency: 20,
    });
    await crawler.run();
	console.log('Total data:');
	console.log(total_data);
});

Tags crawling, Javascript, Puppeteer

Development

Headless Chrome detection and anti-detection

Post author By admin
Post date January 29, 2021
No Comments on Headless Chrome detection and anti-detection

In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.

So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.

See the test results of disguising the browser automation for both Selenium and Puppeteer extra.

Tags anti-scrape, headless, Javascript, scrape detection, scrape protection

Development

How to remove from JS array duplicate elements

Post author By admin
Post date January 26, 2021
No Comments on How to remove from JS array duplicate elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove duplicate elements from an array. The magic is to make it a Set and then use Spread syntax to turn it back to array.

links = [];
$('div.items').each((index, el) => { 
    let link = $(el).attr('href');						     
    links.push(link);
}); 
// remove repeating links
links = [...new Set(links)]

Tags HTML, Javascript

Development

How to remove from JS array empty or `undefined` elements

Post author By admin
Post date January 26, 2021
No Comments on How to remove from JS array empty or `undefined` elements

Often at scraping I collect images or other sequential elements into an array. Yet, afterwards I need to remove empty elements from it.

images = [];
$('div.image').each((index, el) => { 
    let url = $(el).attr('src');						     
    images.push(url);
}); 
// remove invalid images
images = images.filter(function(img){
    return img && !img.includes('undefined')
});

Tags HTML, Javascript

Development

Strip HTML tags with and without inner content in JavaScript

Post author By admin
Post date December 15, 2020
No Comments on Strip HTML tags with and without inner content in JavaScript

function strip_tags(str){
   const tags = ['a', 'em', 'div', 'span', 'p', 'i', 'button', 'img' ];
   const tagsAndContent = ['picture', 'script', 'noscript', 'source'];  	 
   for(tag of tagsAndContent){ 
      let regex = new RegExp( '<' + tag+ '.*?</' + tag + '>', 'gim');
      str = str.replace( regex ,"");
   }
   for(tag of tags){
      let regex1 = new RegExp( '<' + tag+ '.*?>', 'gim');
      let regex2 = new RegExp( '</' + tag+ '>', 'gim');
      str = str.replace(regex1,"").replace(regex2,""); 
   } 
   return str;
}

Tags HTML, Javascript

Development

JavaScript, Regex match groups

Post author By admin
Post date December 3, 2020
2 Comments on JavaScript, Regex match groups

Often we want only a certain info from the matched content. So, groups help with that.

The following example shows how to fetch the [duplicate] entry index from the error message. For that we take 1st group, index “1”:

const regex = /Duplicate entry\s'([^']+)+'/gm;
const str = `{ Error: (conn=42434, no: 1062, SQLState: 23000) Duplicate entry '135' for key 'PRIMARY'
sql: INSERT INTO \`test\` (id , string1) values (?,?) - parameters:[[135,'string 756']]
    at Object.module.exports.createError (C:\\Users\\User\\Documents\\RnD\\Node.js\\mercateo-`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Tags Javascript, Regex

Development

Cheerio scraper escapes special symbols with html entities when performing .html()

Post author By admin
Post date December 1, 2020
1 Comment on Cheerio scraper escapes special symbols with html entities when performing .html()

As developers scrape data off the web, we use Node.js along with handy Cheerio scraper. When fetching .html() Cheerio parser returns the special symbols as HTML encoded entities, eg.:
ä as ä
ß as ß

Cheerio developer vindication of the parser action

(1) It’s not the job of a parser to preserve the original document.
(2) .html() returns an HTML representation of the parsed document, which doesn’t have to be equal to the original document.
source.