Categories
Development Guest posting

Captcha solving with Java and why you should avoid it

In this blog post we are going to show how you can solve [Re]captcha with Java and some third party APIs, and why you should probably avoid them in the first place.
For the Python code (+ captcha API) see that post.

The post author is Kevin Sahin from ScrapingNinja.co.

Captcha solving

“Completely Automated Public Turing test to tell Computers and Humans Apart” is what captcha stands for. Captchas are used to prevent bots from accessing and performing actions on websites or applications.

The last one is the most used captcha mechanism, Google ReCaptcha v2. That’s why we are going to see how to “break” these captchas.

The only thing the user has to do is to click inside the checkbox. The service will then analyze lots of factors to determine if it a real user, or a bot. We don’t know exactly how it is done, Google didn’t disclose this for obvious reasons, but a lot of speculations have been made:

  • Clicking behavior analysis: “where did the user click?”, cursor acceleration, etc.
  • Browser fingerprinting
  • Click location history (do you always click straight on the center, or is it random, like a normal user?)
  • Browser history and cookies

For old captchas like the first one, Optical Character Recognition and recent machine-learning frameworks offer an excellent solving accuracy (sometimes better than Humans…) but for Recaptcha v2 the easiest and more accurate way is to use third-party services.

We have tested captcha services to solve ReCaptcha v2. You can see the results here

Many companies are offering Captcha Solving APIs that use real human operators to solve captchas.  I don’t recommend one in particular, but I have found 2captcha.com easy to use and reliable, but relatively expensive ($2.99 for 1000 recaptchas).

Under the hood, these APIs need the specific site-key and the target website URL; with this information they are able to get a human operator to solve the captcha.

Technically the Recaptcha challenge is an iFrame with some magical Javascript code and some hidden input. When you “solve” the challenge, by clicking or solving an image problem, the hidden input is filled with a valid token.

It is this token that interests us, and 2captcha API will send it back. Then we will need to fill the hidden input with this token and submit the form.

The first thing you will need to do is to create an account on 2captcha.com and add some funds. You will then find your API key on the main dashboard.

We have set up an example webpage with a simple form with one input and a Recaptcha to solve:

captcha_sandboxWe are going to use Chrome in headless mode to post this form and HtmlUnit to make the API calls to 2captcha (we could use any other HTTP client for this). Now let’s code.

final String API_KEY = "YOUR_API_KEY";
final String API_BASE_URL = "http://2captcha.com/";
final String BASE_URL = "https://www.javawebscrapingsandbox.com/captcha";
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF); // replace with your own chromdriver path 
final String chromeDriverPath = "/usr/local/bin/chromedriver";
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--windowsize=1920,1200", "--ignore-certificate-errors", "--silent");
options.addArguments("--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/60.0.3112.113 Chrome/60.0.3112.113 Safari/537.36");
WebDriver driver = new ChromeDriver(options);
driver.get(BASE_URL);

Here is some boilerplate code to instantiate both WebDriver and WebClient, along with the API URL and key.

Then we have to call the 2captcha API with the site-key, your API key, and the website URL, as documented here. The API is supposed to respond with the following format: OK|123456.

String siteId = "";
WebElement elem = driver.findElement(By.xpath("//div[@class='g-recaptcha']"));
try {
    siteId = elem.getAttribute("data-sitekey");
} catch (Exception e) {
    System.err.println("Catpcha's div cannot be found or missing attribute data-sitekey");
    e.printStackTrace();
}
String QUERY = String.format("%sin.php?key=%s&method=userrecaptcha&googlekey=%s&pageurl=%s&here=now", API_BASE_URL, API_KEY, siteId, BASE_URL);
Page response = client.getPage(QUERY);
String stringResponse = response.getWebResponse().getContentAsString();
String jobId = "";
if (!stringResponse.contains("OK")) {
    throw new Exception("Error with 2captcha.com API, received : " + stringResponse);
} else {
    jobId = stringResponse.split("\\|")[1];
}

Now that we have the job ID, we have to loop over another API route to know when the ReCaptcha is solved and get the token, as explained in the documentation. It returns CAPCHA_NOT_READY if it is not yet ready and still the OK|TOKEN when it is ready:

boolean captchaSolved = false;
while (!captchaSolved) {
    response = client.getPage(String.format("%sres.php?key=%s&action=get&id=%s", API_BASE_URL, API_KEY, jobId));
    if (response.getWebResponse().getContentAsString().contains("CAPCHA_NOT_READY")) {
        Thread.sleep(3000);
        System.out.println("Waiting for 2Captcha.com ...");
    } else {
        captchaSolved = true;
        System.out.println("Captcha solved !");
    }
}
String captchaToken = response.getWebResponse().getContentAsString().split("\\|")[1];

Note that it can take up to 1 minute based on my experience. It could be a good idea to implement a safeguard/timeout in the loop, because on rare occasions the captcha never gets solved.

Now that we have the magic token, we just have to find the hidden input, fill it with the token, and submit the form.

The Selenium API cannot fill hidden input, so we have to manipulate the DOM to make the input visible, fill it, and make it hidden again, so that we can click on the submit button:

JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("document.getElementById('g-recaptcha-response').style.display = 'block';");
WebElement textarea = driver.findElement(By.xpath("//textarea[@id='g-recaptcha-response']"));
textarea.sendKeys(captchaToken);
js.executeScript("document.getElementById('g-recaptcha-response').style.display = 'none';");
driver.findElement(By.id("name")).sendKeys("Kevin");
driver.getPageSource();
driver.findElement(By.id("submit")).click();
if (driver.getPageSource().contains("your captcha was successfully submitted")) {
    System.out.println("Captcha successfuly submitted !");
} else {
    System.out.println("Error while submitting captcha");
}

And that’s it :-). The whole Java code you can find here.

Generally, websites don’t use ReCaptcha for each HTTP request, but only for suspicious ones, or for specific actions like account creation, etc. You should always try to figure out if the website is showing you a [Re]captcha because you made too many requests with the same IP address or the same user-agent, or maybe you made too many requests per second.

As you can see, “Recaptcha solving” is quite slow so the best way to “solve” this problem is by avoiding captchas in the first place! In order to do so, we recommend to you an article How to scrape websites without getting blocked, check it out!

Reducing the chance of getting Captcha is better than solving it, it is cheaper and much faster. Sometimes it’s not possible, as the web page shows a Captcha 100% of the time, but in many cases you can by-pass this by being smart with your scrapers.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.