Categories
Development SaaS

Sequentum Cloud to bypass strict scrape blocking

In the modern web 2.0 the sites that have valuable data (eg. business directories, data aggregators, social networks and more) implement aggressive blocking measures, which can cause major extraction difficulties. How can modern scraping tools (eg. Sequentum Cloud) still be able to fetch data of actively protected sites?

Sequentum is a closed source point and click scraping platform that integrates everything we need to bypass anti-bot services, including management of browsers, device fingerprints, TSL fingerprints, IP rotation, user agents, and more. Sequentum has had its own custom scraping browser for more than a decade, and as one of the most mature solutions on the market, they are able to support atomic level customization for each request and workflow step. As such, Sequentum Cloud is an out-of-the-box advanced scraping platform with no upfront requirement to stand up infrastructure, software, or proxies. It also has a very responsive support team, which can be useful in coming up to speed on one’s approach and is quite unique in the scraping industry. In this test, we were able to configure a site with very aggressive blocking and, with some refinement of error detection and retry logic, were able to get some of the most protected data consistently over time.

For this test, we pointed their tool at a major brand on Zoro.com, a site with aggressive blocking. Initial attempts yielded 32K records which was 94% of the estimated 34K entries. We worked with support to understand the ways to customize the advanced error detection and retry logic included in the Sequentum platform to the behavior of the Zoro.com site and was able to get 100% of the data. In this article we are sharing what we have learned.

The overall test results of Sequentum Cloud and Oxylabs API (shared in a post) might be summarized in the following comparison table.

Success rateAvg. seconds per pageEstimated costRating
Sequentum Cloud Agent100%0,4
(provided 10 browsers)
$12
($3.75 per 1GB traffic of res. proxy)
Oxylabs' API90%11~$60
($2 per 1000 requests)

The preconfigured API [of Oxylabs] is already built [and maintained] for an end user. Sequentum Cloud Platform is rather an open tool, and agents can be customized in a myriad of ways. Hence it can take longer to build a working agent [compared to a ready-API] but for the most part a custom agent is the better way to apply in an industrial scale for one’s custom use case.

Categories
Challenge Development

Playwright Scraper Undetected: Strategies for Seamless Web Data Extraction

Web scraping has become an essential tool for many businesses seeking to gather data and insights from the web. As companies increasingly rely on this method for analytics and pricing strategies, the techniques used in scraping are evolving. It is crucial for scrapers to simulate human-like behaviors to avoid detection by sophisticated anti-bot measures implemented by various websites.

Understanding the importance of configuring scraping tools effectively can make a significant difference in acquiring the necessary data without interruptions. The growth in demand for such data has led to innovations in strategies and technology that assist scrapers in navigating these challenges. This article will explore recent developments in tools and libraries that help enhance the functionality of web scraping procedures.

Categories
Development

Amazon scrape tip

Recently we’ve met requirements to scrape Amazon data in big quantities. So, first of all I’ve tested the data aggregator for being bot-proof or anti-bot protection. For that I used the Discord server Scraping Enthusiasts, namely Anti-bot channel.

Since Amazon is a hige data aggregator we recommend readers to get acquainted with the post Tips & Tricks for Scraping Business Directories.

Categories
Challenge Development

Node.js & Privacy Pass application for Cloudflare scrape solution

Over 7.59 million of websites use Cloudflare protection, 26% of
them are among the top 100K website worldwide. As Cloudflare
establishes itself as the norm regarding service protection, chances are, the site you want to scrape is more likely to use it than not.

When it comes to scrapping websites, captchas and other type of
protections were always the main obstacle in providing reliable data collection solutions. And most often this would lead to consider bypass services which aren’t always free.

Categories
Challenge Development

Undetected ChromeDriver in Python Selenium

Selenium comes with a default WebDriver that often fails to bypass scraping anti-bots. Yet you can complement it with Undetected ChromeDriver, a third-party WebDriver tool that will do a better job.

In this tutorial, you’ll learn how to use Undetected ChromeDriver with Selenium in Python and solve the most common errors.

Categories
Challenge Development

How to bypass PerimeterX

You’ve found the website you need to scrape, set up your scraper and fired it, just to sadly realize PerimeterX has blocked you.

PerimeterX’s dynamically complex bot detection system relies on server-side and client-side checks to distinguish humans from bots. It deploys several layers of protection and, for the most part, manages to do its job without interrupting the user experience.

But don’t fall into despair! There are a couple of things you can try to bypass PerimeterX (called HUMAN now) before giving up on your goal of scraping that delicious data.

Categories
Challenge Development

Discord Bot to detect on-site anti-scrape & scrape-proof tools

Today, I’ll share of a Dicord server 1 and server 2 that accomodate a bot able to detect multiple modern scrape-protection and scrape-detection means. The server’s channels with the bot are #antibot-test and #antibot-scan respectively

Categories
Challenge

Bot protected websites

We share here some bot-protected sites.

Categories
Challenge Development

Bypass GoDaddy Firewall thru VPN & browser automation

Recently we encountered a website that worked as usual, yet when composing and running scraping script/agent it has put up blocking measures.

In this post we’ll take a look at how the scraping process went and the measures we performed to overcome that.

Categories
Development

Headless Chrome detection and anti-detection

In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.

So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.

See the test results of disguising the browser automation for both Selenium and Puppeteer extra.