Today, I got in touch with the Node.js [and Python] bots garden/zoo providing modern bots with different kinds of browsers (Firefox, Chrome, Headless/not headless) using different automation frameworks (Puppeteer, Selenium, Playwright) in several programming languages.
We’ve already stated some Tips and Tricks of scraping business directories or data aggregators sites. Yet recently someone has asked us to do aggregators’ scraping in the context of Google Sheets and/or MS Excel.
Recently we encountered a website that worked as usual, yet when composing and running scraping script/agent it has put up blocking measures. In this post we’ll take a look at how the scraping process went and the measures we performed to overcome that.
Recently we encountered a new powerful scraping service called Web Scraper IDE [of Bright Data]. The life-test and thorough drill-in are coming soon. Yet now we want to highlight its main features that has badly (in positive sense, strongly) impressed us.
How to handle cookie, user-agent, headers when scraping with JAVA? We’ll use for this a static class ScrapeHelper that easily handles all of this. The class uses Jsoup library methods to fetch from data from server and parse html into DOM document.
In a previous post we’ve considered the ways to disguise an automated Chrome browser by spoofing some of its parameters – Headless Chrome detection and anti-detection. Here we’ll share the practical results of Fingerprints testing against a benchmark for both human-operated and automated Chrome browsers.
In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar […]
When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer). So, I’ve discovered an […]
When we use Selenium or Node.js + Puppeteer to run [headless] Chrome/Chromium we might need to add some extra functionality/conditions to launch browsers with. Below you’ll find all kinds of Conditions and their explanations. How to use command line switches? The Chromium Team has made a page on which they briefly explain how to use these switches.