Categories
Development

Solve ReCaptcha with Selenium (python)

breaked by seleniumI’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.

Disclaimer. After we’ve published the post, Google has drastically complicated the reCaptcha.

  1. Google engineers have removed iframe’s name attribute that we’ve tried to stick to.
  2. They’ve changed the html markup pertaining to the image puzzle. The table layout is now inside the block layout, a table being of 3×3 to 6×6 boxes large. Therefore the random click solution probability has decreased.
    Thus the scripting solution time has drastically increased. Probability to solve 4×4 table puzzle (with 2-3 tiles to be checked) in a single attempt is now: 2/16*1/15 * 100% = 0.8%. It is orders of magnitude less than original 2.8%.
  3. Google also has set a session timeout limit. So after certain time, it makes reCaptcha solution session to time out.

Practically we still strive to improve the code to beat reCaptcha down. Now I’ve updated the post with the new code!

Brute force works

The brute force approach works best for cracking this remotely supplied (by 3rd party) CAPTCHA. In a previous post I mentioned that the Client (website with CAPTCHA) does not control how many [picture puzzle] challenges the user has to take before passing the reCaptcha. So if one iterates over the image puzzles by randomly checking up pictures and submits a result to the CAPTCHA provider (google) the probability of solving it with a single submission is 2.8%. This value was valid for the year 2015 reCaptcha, but since the picture puzzle complication the probability dropped to less than 1%!

Read more of the theoretical part here.

So we need to program Selenium to automate moves and clicks to fetch the right reCaptcha elements: tiles (pictures), buttons, checkbox (which in turn is just a html block element).

Let’s get started.

Code in pieces

You should jump directly to the whole renewed code (incl. all the imports), but here it is broken down into sections.

This first code piece invokes a basic Firefox browser instance, it grabs content from a URL, saves the main window handler mainWin for further use and identifies the main captcha frame. We identify iframe by the tag name, so the following code will  move the driver to the first iFrame:

# move the driver to the first iFrame 
driver.find_elements_by_tag_name("iframe")[0]

Provided your page containing more than just reCaptcha frames, you should research to find out what would be the frames indexes. Tip: Use iMacro for that.

start = time()	 
url='...'
driver = webdriver.Firefox()
driver.get(url)

mainWin = driver.current_window_handle  

# move the driver to the first iFrame 
driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[0])

[box style=’info’]driver.switch_to_frame() is deprecated. You might replace it with driver.switch_to.frame()[/box]
Now we click on a checkbox, wait till the picture puzzle is on (loaded by reCaptcha’s API) and jump to the second frame, containing puzzles themselves.

# *************  locate CheckBox  **************
CheckBox = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID ,"recaptcha-anchor"))
        ) 

# *************  click CheckBox  ***************
wait_between(0.5, 0.7)  
# making click on captcha CheckBox 
CheckBox.click() 
 
#***************** back to main window *********************************
driver.switch_to_window(mainWin)  

wait_between(2.0, 2.5) 

# ******** switch to the second iframe by tag name ************
driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1])

Next, we continue iterating until we solve the reCaptcha’s picture puzzle. The write_stat procedure writes each passed attempt info into a CSV file for further statistical analysis.

i=1
while i<130:
	print('\n\r{0}-th loop'.format(i))
	# ******** check if checkbox is checked at the 1st frame ***********
	driver.switch_to_window(mainWin)   
	WebDriverWait(driver, 10).until(
        EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME , 'iframe'))
        )  
	wait_between(1.0, 2.0)
	if check_exists_by_xpath('//span[@aria-checked="true"]'): 
                import winsound
		winsound.Beep(400,1500)
		write_stat(i, round(time()-start) - 1 ) # saving results into stat file
		break 
		
	driver.switch_to_window(mainWin)   
	# ********** To the second frame to solve pictures *************
	wait_between(0.3, 1.5) 
	driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1]) 
	solve_images(driver)
	i=i+1

# ***** main procedure to identify and submit picture solution	
def solve_images(driver):	
	WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID ,"rc-imageselect-target"))
        ) 		
	dim = dimention(driver)	
	# ****************** check if there is a clicked tile ******************
	if check_exists_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr/td[@class="rc-imageselect-tileselected"]'):
		rand2 = 0
	else:  
		rand2 = 1 

	# wait before click on tiles 	
	wait_between(0.5, 1.0)		 
	# ****************** click on a tile ****************** 
	tile1 = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH ,   '//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim )))) 
		)   
	tile1.click() 
	if (rand2):
		try:
			driver.find_element_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim))).click()
		except NoSuchElementException:          		
		    print('\n\r No Such Element Exception for finding 2nd tile')
   
	 
	#****************** click on submit buttion ****************** 
	driver.find_element_by_id("recaptcha-verify-button").click()

The whole code

import re, csv
from time import sleep, time
from random import uniform, randint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException    

def write_stat(loops, time):
	with open('stat.csv', 'a', newline='') as csvfile:
		spamwriter = csv.writer(csvfile, delimiter=',',
								quotechar='"', quoting=csv.QUOTE_MINIMAL)
		spamwriter.writerow([loops, time])  	 
	
def check_exists_by_xpath(xpath):
    try:
        driver.find_element_by_xpath(xpath)
    except NoSuchElementException:
        return False
    return True
	
def wait_between(a,b):
	rand=uniform(a, b) 
	sleep(rand)
 
def dimention(driver): 
	d = int(driver.find_element_by_xpath('//div[@id="rc-imageselect-target"]/table').get_attribute("class")[-1]);
	return d if d else 3  # dimention is 3 by default
	
# ***** main procedure to identify and submit picture solution	
def solve_images(driver):	
	WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID ,"rc-imageselect-target"))
        ) 		
	dim = dimention(driver)	
	# ****************** check if there is a clicked tile ******************
	if check_exists_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr/td[@class="rc-imageselect-tileselected"]'):
		rand2 = 0
	else:  
		rand2 = 1 

	# wait before click on tiles 	
	wait_between(0.5, 1.0)		 
	# ****************** click on a tile ****************** 
	tile1 = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH ,   '//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim )))) 
		)   
	tile1.click() 
	if (rand2):
		try:
			driver.find_element_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim))).click()
		except NoSuchElementException:          		
		    print('\n\r No Such Element Exception for finding 2nd tile')
   
	 
	#****************** click on submit buttion ****************** 
	driver.find_element_by_id("recaptcha-verify-button").click()

start = time()	 
url='...'
driver = webdriver.Firefox()
driver.get(url)

mainWin = driver.current_window_handle  

# move the driver to the first iFrame 
driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[0])

# *************  locate CheckBox  **************
CheckBox = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID ,"recaptcha-anchor"))
        ) 

# *************  click CheckBox  ***************
wait_between(0.5, 0.7)  
# making click on captcha CheckBox 
CheckBox.click() 
 
#***************** back to main window **************************************
driver.switch_to_window(mainWin)  

wait_between(2.0, 2.5) 

# ************ switch to the second iframe by tag name ******************
driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1])  
i=1
while i<130:
	print('\n\r{0}-th loop'.format(i))
	# ******** check if checkbox is checked at the 1st frame ***********
	driver.switch_to_window(mainWin)   
	WebDriverWait(driver, 10).until(
        EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME , 'iframe'))
        )  
	wait_between(1.0, 2.0)
	if check_exists_by_xpath('//span[@aria-checked="true"]'): 
                import winsound
		winsound.Beep(400,1500)
		write_stat(i, round(time()-start) - 1 ) # saving results into stat file
		break 
		
	driver.switch_to_window(mainWin)   
	# ********** To the second frame to solve pictures *************
	wait_between(0.3, 1.5) 
	driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1]) 
	solve_images(driver)
	i=i+1

Timeout limits

Now reCaptcha is session timeout sensitive. So if the brute force fails to solve it within a certain period of time, reCaptcha’s JS algorithm deliberately stops any interactions:
– makes reCaptcha ticked up
– google server returns {"success":"false"} upon siteverify (point 3: ‘decode the response’).

recaptcha solutionI’ve found the reCaptcha timeout being appx. 2.5-3 min.

Proxy usage

Some info from Lanre (out reader and contributer):

          High load sites challange

I also discovered while working on a site where a filling a form for a few thousands of people that this new captcha concept is linked to the IP address such that as you progress in the iteration, your chances of getting verified reduces until above 50-80 iterations and by this time your session is timing out and the captcha is no longer valid.

Dynamic IP

I wasn’t able to successfully resolve the captchas i guess due to the fact that my ISP is a static one. However, whenever i used DHCP on another ISP, i was able to resolve the captchas but only once and then i will need to restart my router before any other successful resolution of captchas.

So a question arose: Is the proxy concept a way to go resolve this challenge?
I think it’s worth to have new IP each time you reach site with reCaptcha so that Google would have no negative history about bruite forth attempts to solve reCaptcha. Otherwise google reCaptcha algorithm accumulates negative solution/timeout info for particular IP and makes following picture puzzles of higher complexity. The approach with proxy does not eliminate the reCaptcha bot suspision. Do not forget to spoof user-agent notation when using proxying.

Conclusion

The remotely managed puzzle CAPTCHA turned out to be vulnerable to brute force, yet after Google has enhanced it, it’s not that simple for crack. There is a poor correlation between the user sent attempts number (submitting form with CAPTHCA) and picture puzzle challenges set to user. So the browser automation brute force has performed to break a seemingly dead-lock reCaptcha. Because of the increased puzzle complexity, the brute force might fail to solve it within reCaptcha session time. So reCaptcha session timeout minimizes this kind of solution. Success rate now (Apr. 2016) being ~30%. The average timeout is 3 min.

Comments or algorithm improvement suggestions welcome!

53 replies on “Solve ReCaptcha with Selenium (python)”

This might be cause of bad network connection with google server or the google server might be overloaded at times, so the Selenium throws timeout error. Try to increase waiting times: wait_between(0.3, 0.8)

I get an error thrown on the line “frameName1 = get_captcha_frames(driver.page_source)[0]”

It says list index is out of range. I’m guessing this is because the regex is not working?

I am having an issue with the line: “frameName1 = get_captcha_frames(driver.page_source)[0]”

I guess this is because the regex is not working properly and thus is generating an empty list?

How about we run JS code in Selenium to insert name and id attributes into iframe tags: document.getElementsByTagName(“H1”)[i].setAttribute(“name”, “name1”); document.getElementsByTagName(“H1”)[i].setAttribute(“id”, “name1”);

Then we switch to them as usial thru driver.switch_to_frame(“name1”)

Can you please explain more on this concept especially how with selenium i can add a name tag to an iframe using python. i was able to get the 1st iframe but switching between the other main frame and the second iframe is a challenge for me. I am able to click the checkbox using this frameName1 =regDriver.switch_to_frame(regDriver.find_elements_by_tag_name(“iframe”)[0]) instead of frameName1 = get_captcha_frames(regDriver.page_source)[0]
Thanks

Seems to me Google has complicated its reCaptcha code by removing the iframe’ name attribute. So, fetching iframe name with regex is not possible any more. This also makes hardly possible to switch between frames.

How with Selenium can I add a name attribute to an iframe using python?

Seems to me Selenium allows to execute JS on a page, so this Selenium-driven JS might be used to add/remove iframes’ attributes, including names. Following this, one may identify different in-page iframes.

I plan to redo the code in a near future.

Hi,
Thanks, will be waiting closely for the new version however, i finally got it to work. It concept is anytime you want to swith frames just use xpath to return all iframes in the page and then use the index to access it. But for this, you will have to go through the page source to find out how many iframes are always there and the postioning at each time you intend to access it. Its just like access the iframes pretty manually (google actually tried to keep regex of in order to make a safer recapcha concept).
Like my previous post, regDriver.switch_to_frame(regDriver.find_elements_by_tag_name(“iframe”)[0]) will access the first iframe and regDriver.switch_to_frame(regDriver.find_elements_by_tag_name(“iframe”)[1]) the next. But of course like i said you have to be sure which iframe is which at every time you code is accessing the iframe.

Hi, i also discovered while working on a site where a filling a form for a few thousands of people that this new captcha concept is linked to the IP address such that as you progress in the iteration, your chances of getting verified reduces until above 50-80 iterations and by this time; your session is timing out and the captcha is no longer valid. Perhaps you might want to see to this situation in your update.

Thanks

…above 50-80 iterations and by this time; your session is timing out and the captcha is no longer valid

Session time out did not happen with me, even in huge iteration number. Can you expose any such cases?

Hi Igor, I have sent you a mail and attached the code i composed for the issue am having so you can go over for testing and possible corrections and additions to resolve the issue of session timeout on the site after successful captcha resolution the first time. Thanks

It is quite dificult now. I managed to develop a model giving me 96% accuracy on their audio captcha but even if I enter ten times the correct solution via selenium, it continue asking new trials… If I enter the solution manually it works => mouse detection is playing

The only way now will be to dig deeply in it:
https://github.com/neuroradiology/InsideReCaptcha

Anyone interested?

What makes it 30% efficient? Is the brute force component 30%?

Aaron R. Phalen

Now reCaptcha is session timeout sensitive. So if the brute force fails to solve it within a certain period of time, reCaptcha’s JS algorithm stops any interactions:
– makes reCaptcha ticked up
– google server returns {"success":"false"} upon siteverify (point 3: ‘decode the response’).

I have been interested in this solution where I work as a security consultant for a company.. I see the timeout of the brute force is an issue per the solution description, however I was wondering into using a service like DeathByCaptcha’s API to receive solution to the puzzle and apply solution to the ReCaptcha iframe DOM. This would significantly reduce solution time and avoid timeout issue as it is superior to the brute force method.

An additional thought is a solution for when ReCaptcha states that it will require multiple submissions. Thoughts?

Thanks!

Sure, you can apply a service like DeathByCaptcha’s API to receive solution to the puzzle and it likely can be the quicker one than the brute force approach. I may mention that google reCaptcha algorithm might require several successful puzzle solutions [in a row] before it validates a solver.

Thanks for the reply.. Any ideas on how to incorporate the solved coordinates from DeathByCatpca. I.E – an array given back as the coordinates needed to be clicked in respect to the top-left vertex of the captcha image? I am thinking an algorithm to transform coordinates to needed selector path or perhaps an offset click event in selenium.. Ideally, looking to research into the latter. Your thoughts?

Best,
Aaron

I have found the above solution does not work, as click() on the thumbnail captcha image is not triggering for me… I tried adding a class to the td, as to signify being checked, however this was not recognized upon clicking the verify button.. Thoughts?

Hello, past few days, when you try to click the checkbox through webdriverI keep getting CAPTCHA with “disappearing” pictures (Click verify once there are none left.) everytime!
Someone knows what to do? Any thinks, thanks

Can you tell me what’ wrong with me
When i accessed the code in python, the screen showed this up.
Can you help me?
Traceback (most recent call last):
File “/Users/xxx/Desktop/Test.py”, line 109, in
solve_images(driver)
File “/Users/xxx/Desktop/Test.py”, line 35, in solve_images
EC.presence_of_element_located((By.ID ,”rc-imageselect-target”))
File “/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/support/wait.py”, line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
at FirefoxDriver.prototype.findElementInternal_ (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/driver-component.js:10770)
at FirefoxDriver.prototype.findElement (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/driver-component.js:10779)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/command-processor.js:12661)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/command-processor.js:12666)
at DelayedCommand.prototype.execute/< (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/command-processor.js:12608)

This code doesn’t work again.
I’m getting error:

Traceback (most recent call last):
File “…/recaptcha.py”, line 96, in
driver.switch_to_frame(driver.find_elements_by_tag_name(“iframe”)[1])
IndexError: list index out of range

like there wasn’t 2 frames, after clicking checkbox.
Any idea how to solve this problem ?
I was testing here: http://patrickhlauke.github.io/recaptcha/

Traceback (most recent call last):
File “…./recaptcha.py”, line 87, in
driver.switch_to.frame(driver.find_elements_by_tag_name(“iframe”)[1])
IndexError: list index out of range

I did change “driver.switch_to_frame” to “driver.switch_to.frame”, but sill have the same error.

Me too …

Traceback (most recent call last):
File “D:\Python\shell.py”, line 87, in
driver.switch_to_frame(driver.find_elements_by_tag_name(“iframe”)[1])
IndexError: list index out of range

Hi,
You can try to use find_element_by_xpath:
driver.switch_to.frame(driver.find_element_by_xpath(“//iframe[@title=’recaptcha challenge’]”))

It would be great if they could find a solution (in connection with the latest changes). I’m also looking for an option to transfer the received captcha to special sites for its decryption.

Thanks the brute force method sharing, and I have modified the python code to test our website(https://www.direct2drive.com/#!/pc).

# reference http://scraping.pro/recaptcha-solve-selenium-python/
# The brute force approach works best for cracking this remotely supplied (by 3rd party) CAPTCHA.
import re, csv
from time import sleep, time
from random import uniform, randint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

def write_stat(loops, time):
with open(‘stat.csv’, ‘a’, newline=”) as csvfile:
spamwriter = csv.writer(csvfile, delimiter=’,’,
quotechar='”‘, quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow([loops, time])

def check_exists_by_xpath(xpath):
try:
driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True

def wait_between(a, b):
rand = uniform(a, b)
sleep(rand)

def dimention(driver):
d = int(driver.find_element_by_xpath(‘//div[@id=”rc-imageselect-target”]/table’).get_attribute(“class”)[-1]);
return d if d else 3 # dimention is 3 by default

# ***** main procedure to identify and submit picture solution
def solve_images(driver):
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, “rc-imageselect-target”)))
dim = dimention(driver)
# ****************** check if there is a clicked tile ******************
if check_exists_by_xpath(‘//div[@id=”rc-imageselect-target”]/table/tbody/tr/td[@class=”rc-imageselect-tileselected”]’):
rand2 = 0
else:
rand2 = 1

# wait before click on tiles
wait_between(0.5, 1.0)
# ****************** click on a tile ******************
tile1 = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, ‘//div[@id=”rc-imageselect-target”]/table/tbody/tr[{0}]/td[{1}]’.format(
randint(1, dim), randint(1, dim))))
)
tile1.click()
if (rand2):
try:
driver.find_element_by_xpath(
‘//div[@id=”rc-imageselect-target”]/table/tbody/tr[{0}]/td[{1}]’.format(randint(1, dim),
randint(1, dim))).click()
except NoSuchElementException:
print(‘\n\r No Such Element Exception for finding 2nd tile’)

# ****************** click on submit buttion ******************
driver.find_element_by_id(“recaptcha-verify-button”).click()

print(“start…”)
start = time()
# go to D2D website
url = ‘https://www.direct2drive.com/#!/pc’
driver = webdriver.Firefox()
driver.get(url)

# open login/sign up page
login_btn = driver.find_element_by_xpath(“//ul[@id=’navMenu’]/li[2]/div/div/a”)
login_btn.click()
sleep(3)

mainWin = driver.current_window_handle

# move the driver to the first iFrame
element = driver.find_element_by_xpath(“//div[@id=’captcha’]/div/div/iframe”)
driver.switch_to.frame(element)

# ************* locate CheckBox **************
btn = driver.find_element_by_xpath(“//*[@id=’recaptcha-anchor-label’]”)
btn.click()
sleep(1)

# ***************** back to main window **************************************
# driver.switch_to.window(mainWin)
driver.switch_to.default_content()

wait_between(2.0, 2.5)

# ************ switch to the second iframe by tag name ******************
driver.switch_to.frame(driver.find_element_by_xpath(“//iframe[@title=’recaptcha challenge’]”))

i = 1
while i < 130:
print('\n\r{0}-th loop'.format(i))
# # ******** check if checkbox is checked at the 1st frame ***********
# driver.switch_to.window(mainWin)
# WebDriverWait(driver, 10).until(
# EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME, 'iframe'))
# )
# wait_between(1.0, 2.0)
# if check_exists_by_xpath('//span[@aria-checked="true"]'):
# import winsound
#
# winsound.Beep(400, 1500)
# write_stat(i, round(time() – start) – 1) # saving results into stat file
# break
#
# driver.switch_to.window(mainWin)
# # ********** To the second frame to solve pictures *************
# wait_between(0.3, 1.5)
# driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[1])
solve_images(driver)
i = i + 1

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.