Over 7.59 million of websites use Cloudflare protection, 26% of
them are among the top 100K website worldwide. As Cloudflare
establishes itself as the norm regarding service protection, chances are, the site you want to scrape is more likely to use it than not.
When it comes to scrapping websites, captchas and other type of
protections were always the main obstacle in providing reliable data collection solutions. And most often this would lead to consider bypass services which aren’t always free.
If you are still here, you are in luck because I will share with you what by now should be an important technique in every scrapper-dev’s toolkit.
In November 2017, Cloudflare released the first implementation of a
protocol called Privacy Pass. Its main purpose according to them
is, to balance privacy and security.
The protocol, implemented by their browser extension available on
both Chrome and Firefox, enables the user to generate 30 redeemable tokens per captcha solved. These tokens are automatically used to bypass the challenges whenever you are navigating to a website with Cloudflare protection.
The extension works on two providers, Cloudflare and
hCaptcha.
While the intent of this protocol is to save the users from repetitive challenge solving on every navigation or access, we developers leverage this protocol to bypass the protection altogether. Furthermore, we can do so programmatically in just two steps.
Step 1 – Extract the Tokens from the Extension
So far, the only known way to generate these blinded tokens is through the extension. Once that is done, we can extract them from the browser’s local storage to a file in our application.
See the video below for the full process.
Step 2 – Redeem the Token in your Application
After you load a token into a variable, you need to install theprivacy-pass-redeemer-node
NPM package and use the getRedemptionHeader
function to redeem the token and generate the headers to inject into our request to successfully bypass the captcha.
import axios from 'axios';
import { getRedemptionHeader, PrivacyPassToken } from 'privacy-pass-redeemer';
// Load the token into a variable.
const token: PrivacyPassToken = { /* ... */ }
// Redeem the token while generating the header
const redemptionTokenHeader = getRedemptionHeader(
token,
'https://example.com/some/path',
'GET'
);
// Inject the header into the request.
const resp = await axios.get('https://example.com/some/path', {
headers: redemptionTokenHeader
});
Note that the snippet above was taken from the package’s NPM page.
A more practical scenario would be while using web automation tool like Puppeteer or HeroJS where the requests are performed automatically by the browser as we attempt to access the website.
For our project we had to use HeroJS and here is why.
Hero is a free and open source headless browser that’s designed specifically for scraping, rather than just automated testing.
Hero’s impressive features include a fully compliant DOM recreated in NodeJS, allowing web developers to easily bypass previous scraper tool headaches. Additionally, the powerful Chrome engine provides lightning-fast rendering and emulators allow you to disguise your script as practically any browser.
And the best part? Hero avoids detection along the entire stack, ensuring that you won’t be blocked because of TLS fingerprints in your networking stack.
import Hero from "@ulixee/hero";
import Server from "@ulixee/server";
import { getRedemptionHeader, PrivacyPassToken } from "privacy-pass-redeemer";
(async () => {
let url = "https://example.com/some/path";
//set up hero
let server = new Server();
await server.listen({
port: 8080,
});
let client = new Hero({
connectionToCore: {
host: `ws://localhost:8080`,
},
});
// Load the token into a variable.
const token: PrivacyPassToken = { /* ... */ };
console.log("[+] Attempting To bypass captcha ....");
await client.goto(url);
await client.fetch(url, {
headers: getRedemptionHeader(token, url, "GET"),
});
})();
Note that the snippet above was taken from the GitHub repository below.
To use Privacy Pass in this case, you can perform a fetch
operation in the browser’s context while loading the headers obtained through redemption whenever you detect a captcha. And if your application shuts down your automation instances frequently as part of the business logic, you can enable persistency of the token’s effect by saving and loading profile’s data.
Sum up
To sum up, the Privacy Pass Procol offers a reliable way to bypass protection given that you generate an adequate number of tokens for your application and refill when necessary. Even though, the process is not fully automatic, you should keep in mind that it is free and quite simple to implement.
If you want a more practical guide on the topic with various
scrapping tools, leave a comment and let’s us know what you think.