Recently we encountered a website that worked as usual, yet when composing and running scraping script/agent it has put up blocking measures.
In this post we’ll take a look at how the scraping process went and the measures we performed to overcome that.
In the post we summarize how to detect the headless Chrome browser and how to bypass the detection. The headless browser testing should be a very important part of todays web 2.0. If we look at some of the site’s JS, we find them to checking on many fields of a browser. They are similar to those collected by fingerprintjs2.
So in this post we consider most of them and show both how to detect the headless browser by those attributes and how to bypass that detection by spoofing them.
Given: a webpage to scrape.
If you inspect the DOM tree of that page you will find that quite a few tags are having the keyword dist. As an example:
<link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
<link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
Imperva (that includes the former Distil anti-bot management) is a service providing many kinds of website protections. The present Imperva services include the following ones:
- Cloud Web Application Firewall (WAF)
- Bot Protection service (formerly Distil Networks)
- IP Reputation Intelligence
- Content Delivery Network (CDN)
- Attack Analytics solution (eg. DDoS)
As to the protection of the bot scraping activities we mention the following.
The Distil scrape protection is a prominent one in the modern anti-scrape techniques. So, now we want to share with you some tips of how to bypass it. If you are interested, please make an inquiry to the following email:
Cyber-attacks are becoming a real threat to businesses both small and large. The damage they bring into people’s lives is more severe than people presume. In 2019, hundreds of billions of dollars went down this tunnel, and the crime is yet to stop. With the evolvement of threat landscapes, attacks are becoming more and more sophisticated. It has also become clear that big companies need to understand that they cannot be 100% secure from such breaches. The real question is, if hackers manage to attack the big companies, how long would it take them to steal your data? The only way to handle this menace is if you understand these basic security strategies and implement them.
For details of how to bypass distil-network, the anti-scraper protection, please contact by email: igor [dot] savinkin [at] gmail [dot] com.
Web scraping is a technique that enables quick in-depth data retrieving. It can be used to help people of all fields, capturing massive data and information from the internet.