I’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.
- Google engineers have removed iframe’s name attribute that we’ve tried to stick to.
- They’ve changed the html markup pertaining to the image puzzle. The table layout is now inside the block layout, a table being of 3×3 to 6×6 boxes large. Therefore the random click solution probability has decreased.
Thus the scripting solution time has drastically increased. Probability to solve 4×4 table puzzle (with 2-3 tiles to be checked) in a single attempt is now: 2/16*1/15 * 100% = 0.8%. It is orders of magnitude less than original 2.8%. - Google also has set a session timeout limit. So after certain time, it makes reCaptcha solution session to time out.
Practically we still strive to improve the code to beat reCaptcha down. Now I’ve updated the post with the new code!
Brute force works
The brute force approach works best for cracking this remotely supplied (by 3rd party) CAPTCHA. In a previous post I mentioned that the Client (website with CAPTCHA) does not control how many [picture puzzle] challenges the user has to take before passing the reCaptcha. So if one iterates over the image puzzles by randomly checking up pictures and submits a result to the CAPTCHA provider (google) the probability of solving it with a single submission is 2.8%. This value was valid for the year 2015 reCaptcha, but since the picture puzzle complication the probability dropped to less than 1%!
Read more of the theoretical part here.
So we need to program Selenium to automate moves and clicks to fetch the right reCaptcha elements: tiles (pictures), buttons, checkbox (which in turn is just a html block element).
Let’s get started.
Code in pieces
You should jump directly to the whole renewed code (incl. all the imports), but here it is broken down into sections.
This first code piece invokes a basic Firefox browser instance, it grabs content from a URL, saves the main window handler mainWin for further use and identifies the main captcha frame. We identify iframe by the tag name, so the following code will move the driver to the first iFrame:
# move the driver to the first iFrame driver.find_elements_by_tag_name("iframe")[0]
Provided your page containing more than just reCaptcha frames, you should research to find out what would be the frames indexes. Tip: Use iMacro for that.
start = time() url='...' driver = webdriver.Firefox() driver.get(url) mainWin = driver.current_window_handle # move the driver to the first iFrame driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[0])
[box style=’info’]driver.switch_to_frame()
is deprecated. You might replace it with driver.switch_to.frame()
[/box]
Now we click on a checkbox, wait till the picture puzzle is on (loaded by reCaptcha’s API) and jump to the second frame, containing puzzles themselves.
# ************* locate CheckBox ************** CheckBox = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID ,"recaptcha-anchor")) ) # ************* click CheckBox *************** wait_between(0.5, 0.7) # making click on captcha CheckBox CheckBox.click() #***************** back to main window ********************************* driver.switch_to_window(mainWin) wait_between(2.0, 2.5) # ******** switch to the second iframe by tag name ************ driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1])
Next, we continue iterating until we solve the reCaptcha’s picture puzzle. The write_stat procedure writes each passed attempt info into a CSV file for further statistical analysis.
i=1 while i<130: print('\n\r{0}-th loop'.format(i)) # ******** check if checkbox is checked at the 1st frame *********** driver.switch_to_window(mainWin) WebDriverWait(driver, 10).until( EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME , 'iframe')) ) wait_between(1.0, 2.0) if check_exists_by_xpath('//span[@aria-checked="true"]'): import winsound winsound.Beep(400,1500) write_stat(i, round(time()-start) - 1 ) # saving results into stat file break driver.switch_to_window(mainWin) # ********** To the second frame to solve pictures ************* wait_between(0.3, 1.5) driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1]) solve_images(driver) i=i+1 # ***** main procedure to identify and submit picture solution def solve_images(driver): WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID ,"rc-imageselect-target")) ) dim = dimention(driver) # ****************** check if there is a clicked tile ****************** if check_exists_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr/td[@class="rc-imageselect-tileselected"]'): rand2 = 0 else: rand2 = 1 # wait before click on tiles wait_between(0.5, 1.0) # ****************** click on a tile ****************** tile1 = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH , '//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim )))) ) tile1.click() if (rand2): try: driver.find_element_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim))).click() except NoSuchElementException: print('\n\r No Such Element Exception for finding 2nd tile') #****************** click on submit buttion ****************** driver.find_element_by_id("recaptcha-verify-button").click()
The whole code
import re, csv from time import sleep, time from random import uniform, randint from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException def write_stat(loops, time): with open('stat.csv', 'a', newline='') as csvfile: spamwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) spamwriter.writerow([loops, time]) def check_exists_by_xpath(xpath): try: driver.find_element_by_xpath(xpath) except NoSuchElementException: return False return True def wait_between(a,b): rand=uniform(a, b) sleep(rand) def dimention(driver): d = int(driver.find_element_by_xpath('//div[@id="rc-imageselect-target"]/table').get_attribute("class")[-1]); return d if d else 3 # dimention is 3 by default # ***** main procedure to identify and submit picture solution def solve_images(driver): WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID ,"rc-imageselect-target")) ) dim = dimention(driver) # ****************** check if there is a clicked tile ****************** if check_exists_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr/td[@class="rc-imageselect-tileselected"]'): rand2 = 0 else: rand2 = 1 # wait before click on tiles wait_between(0.5, 1.0) # ****************** click on a tile ****************** tile1 = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH , '//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim )))) ) tile1.click() if (rand2): try: driver.find_element_by_xpath('//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(randint(1, dim), randint(1, dim))).click() except NoSuchElementException: print('\n\r No Such Element Exception for finding 2nd tile') #****************** click on submit buttion ****************** driver.find_element_by_id("recaptcha-verify-button").click() start = time() url='...' driver = webdriver.Firefox() driver.get(url) mainWin = driver.current_window_handle # move the driver to the first iFrame driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[0]) # ************* locate CheckBox ************** CheckBox = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID ,"recaptcha-anchor")) ) # ************* click CheckBox *************** wait_between(0.5, 0.7) # making click on captcha CheckBox CheckBox.click() #***************** back to main window ************************************** driver.switch_to_window(mainWin) wait_between(2.0, 2.5) # ************ switch to the second iframe by tag name ****************** driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1]) i=1 while i<130: print('\n\r{0}-th loop'.format(i)) # ******** check if checkbox is checked at the 1st frame *********** driver.switch_to_window(mainWin) WebDriverWait(driver, 10).until( EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME , 'iframe')) ) wait_between(1.0, 2.0) if check_exists_by_xpath('//span[@aria-checked="true"]'): import winsound winsound.Beep(400,1500) write_stat(i, round(time()-start) - 1 ) # saving results into stat file break driver.switch_to_window(mainWin) # ********** To the second frame to solve pictures ************* wait_between(0.3, 1.5) driver.switch_to_frame(driver.find_elements_by_tag_name("iframe")[1]) solve_images(driver) i=i+1
Timeout limits
Now reCaptcha is session timeout sensitive. So if the brute force fails to solve it within a certain period of time, reCaptcha’s JS algorithm deliberately stops any interactions:
– makes reCaptcha ticked up
– google server returns {"success":"false"}
upon siteverify (point 3: ‘decode the response’).
I’ve found the reCaptcha timeout being appx. 2.5-3 min.
Proxy usage
Some info from Lanre (out reader and contributer):
High load sites challange
I also discovered while working on a site where a filling a form for a few thousands of people that this new captcha concept is linked to the IP address such that as you progress in the iteration, your chances of getting verified reduces until above 50-80 iterations and by this time your session is timing out and the captcha is no longer valid.
Dynamic IP
I wasn’t able to successfully resolve the captchas i guess due to the fact that my ISP is a static one. However, whenever i used DHCP on another ISP, i was able to resolve the captchas but only once and then i will need to restart my router before any other successful resolution of captchas.
So a question arose: Is the proxy concept a way to go resolve this challenge?
I think it’s worth to have new IP each time you reach site with reCaptcha so that Google would have no negative history about bruite forth attempts to solve reCaptcha. Otherwise google reCaptcha algorithm accumulates negative solution/timeout info for particular IP and makes following picture puzzles of higher complexity. The approach with proxy does not eliminate the reCaptcha bot suspision. Do not forget to spoof user-agent notation when using proxying.
Conclusion
The remotely managed puzzle CAPTCHA turned out to be vulnerable to brute force, yet after Google has enhanced it, it’s not that simple for crack. There is a poor correlation between the user sent attempts number (submitting form with CAPTHCA) and picture puzzle challenges set to user. So the browser automation brute force has performed to break a seemingly dead-lock reCaptcha. Because of the increased puzzle complexity, the brute force might fail to solve it within reCaptcha session time. So reCaptcha session timeout minimizes this kind of solution. Success rate now (Apr. 2016) being ~30%. The average timeout is 3 min.
Comments or algorithm improvement suggestions welcome!
53 replies on “Solve ReCaptcha with Selenium (python)”
Do you have java version of this code.
Can you please send it to my id
Thanks
I also would like to know java version I am stuck at clicking verify button and getting pictures inside recaptcha etc
Do you have a java version for the above code? If yes, then please help me out.
Hello, this code issued timeout without finding ‘recaptcha-anchor’ checkbox. what could it be? Thanks
This might be cause of bad network connection with google server or the google server might be overloaded at times, so the Selenium throws timeout error. Try to increase waiting times:
wait_between(0.3, 0.8)
I get an error thrown on the line “frameName1 = get_captcha_frames(driver.page_source)[0]”
It says list index is out of range. I’m guessing this is because the regex is not working?
Adam, thanks for the comment. Seems google has changed the iframe tag attributes. Now it does not contain id nor name attributes.
We need to fetch frame by other hooks, ex. role=”presentation”.
I am having an issue with the line: “frameName1 = get_captcha_frames(driver.page_source)[0]”
I guess this is because the regex is not working properly and thus is generating an empty list?
How about we run JS code in Selenium to insert name and id attributes into iframe tags: document.getElementsByTagName(“H1”)[i].setAttribute(“name”, “name1”); document.getElementsByTagName(“H1”)[i].setAttribute(“id”, “name1”);
…
Then we switch to them as usial thru driver.switch_to_frame(“name1”)
Can you please explain more on this concept especially how with selenium i can add a name tag to an iframe using python. i was able to get the 1st iframe but switching between the other main frame and the second iframe is a challenge for me. I am able to click the checkbox using this frameName1 =regDriver.switch_to_frame(regDriver.find_elements_by_tag_name(“iframe”)[0]) instead of frameName1 = get_captcha_frames(regDriver.page_source)[0]
Thanks
Seems to me Google has complicated its reCaptcha code by removing the iframe’ name attribute. So, fetching iframe name with regex is not possible any more. This also makes hardly possible to switch between frames.
Seems to me Selenium allows to execute JS on a page, so this Selenium-driven JS might be used to add/remove iframes’ attributes, including names. Following this, one may identify different in-page iframes.
I plan to redo the code in a near future.
This possible translate to c# ?
Hi,
Thanks, will be waiting closely for the new version however, i finally got it to work. It concept is anytime you want to swith frames just use xpath to return all iframes in the page and then use the index to access it. But for this, you will have to go through the page source to find out how many iframes are always there and the postioning at each time you intend to access it. Its just like access the iframes pretty manually (google actually tried to keep regex of in order to make a safer recapcha concept).
Like my previous post, regDriver.switch_to_frame(regDriver.find_elements_by_tag_name(“iframe”)[0]) will access the first iframe and regDriver.switch_to_frame(regDriver.find_elements_by_tag_name(“iframe”)[1]) the next. But of course like i said you have to be sure which iframe is which at every time you code is accessing the iframe.
Hi, i also discovered while working on a site where a filling a form for a few thousands of people that this new captcha concept is linked to the IP address such that as you progress in the iteration, your chances of getting verified reduces until above 50-80 iterations and by this time; your session is timing out and the captcha is no longer valid. Perhaps you might want to see to this situation in your update.
Thanks
Sure, reCaptcha is IP bound. So, if wrong attempts are persistently done, the reCaptcha algorithm will put forth the harder challenges.
Session time out did not happen with me, even in huge iteration number. Can you expose any such cases?
OK, I will send you a mail to check out the site am working on. That will be a clearer approach.
Hi, have you gone over the code to check the issue I raised. Want to know your possible time line. Thanks
Ventis, sorry. I’m loaded with work to do. So possibly within two weeks I can do it. Yet, you can try to compose the code and submit for my testing…
Hi Igor, I have sent you a mail and attached the code i composed for the issue am having so you can go over for testing and possible corrections and additions to resolve the issue of session timeout on the site after successful captcha resolution the first time. Thanks
hi Igor Savinkin , now recaptcha solve selenium it still work?
i tring , it is not working .
No. google has sophisticated it since then. I think it takes too many tricks to apply to break it now.
It is quite dificult now. I managed to develop a model giving me 96% accuracy on their audio captcha but even if I enter ten times the correct solution via selenium, it continue asking new trials… If I enter the solution manually it works => mouse detection is playing
The only way now will be to dig deeply in it:
https://github.com/neuroradiology/InsideReCaptcha
Anyone interested?
Aaron R. Phalen
Now reCaptcha is session timeout sensitive. So if the brute force fails to solve it within a certain period of time, reCaptcha’s JS algorithm stops any interactions:
– makes reCaptcha ticked up
– google server returns
{"success":"false"}
upon siteverify (point 3: ‘decode the response’).I have been interested in this solution where I work as a security consultant for a company.. I see the timeout of the brute force is an issue per the solution description, however I was wondering into using a service like DeathByCaptcha’s API to receive solution to the puzzle and apply solution to the ReCaptcha iframe DOM. This would significantly reduce solution time and avoid timeout issue as it is superior to the brute force method.
An additional thought is a solution for when ReCaptcha states that it will require multiple submissions. Thoughts?
Thanks!
Sure, you can apply a service like DeathByCaptcha’s API to receive solution to the puzzle and it likely can be the quicker one than the brute force approach. I may mention that google reCaptcha algorithm might require several successful puzzle solutions [in a row] before it validates a solver.
Thanks for the reply.. Any ideas on how to incorporate the solved coordinates from DeathByCatpca. I.E – an array given back as the coordinates needed to be clicked in respect to the top-left vertex of the captcha image? I am thinking an algorithm to transform coordinates to needed selector path or perhaps an offset click event in selenium.. Ideally, looking to research into the latter. Your thoughts?
Best,
Aaron
So far no thoughts.
Is there a javascript version of this?
So far not. Selenium works well to drive browser with all its events processing.
I have found the above solution does not work, as click() on the thumbnail captcha image is not triggering for me… I tried adding a class to the td, as to signify being checked, however this was not recognized upon clicking the verify button.. Thoughts?
Let me recheck the code tonight or tomorrow.
Looking forward to this Igor Savinkin.
Also, for your consideration:
http://nsl .cs.columbia. edu/papers/2016/recaptcha.eurosp16.pdf
(remove spaces)
Thank you, Alex.
Aaron, I’ve replaced the code with new one. It works normal at my env. (Win 64 + python). Any particular issues?
Yep, working.. Thanks.
Hello, past few days, when you try to click the checkbox through webdriverI keep getting CAPTCHA with “disappearing” pictures (Click verify once there are none left.) everytime!
Someone knows what to do? Any thinks, thanks
Can you tell me what’ wrong with me
When i accessed the code in python, the screen showed this up.
Can you help me?
Traceback (most recent call last):
File “/Users/xxx/Desktop/Test.py”, line 109, in
solve_images(driver)
File “/Users/xxx/Desktop/Test.py”, line 35, in solve_images
EC.presence_of_element_located((By.ID ,”rc-imageselect-target”))
File “/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/support/wait.py”, line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
at FirefoxDriver.prototype.findElementInternal_ (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/driver-component.js:10770)
at FirefoxDriver.prototype.findElement (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/driver-component.js:10779)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/command-processor.js:12661)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/command-processor.js:12666)
at DelayedCommand.prototype.execute/< (file:///var/folders/4b/2_csbmp13t70gfms_15q9n3h0000gn/T/tmp0lyqljb6/extensions/fxdriver@googlecode.com/components/command-processor.js:12608)
This code doesn’t work again.
I’m getting error:
Traceback (most recent call last):
File “…/recaptcha.py”, line 96, in
driver.switch_to_frame(driver.find_elements_by_tag_name(“iframe”)[1])
IndexError: list index out of range
like there wasn’t 2 frames, after clicking checkbox.
Any idea how to solve this problem ?
I was testing here: http://patrickhlauke.github.io/recaptcha/
Hey, I’m having the exact same issue you had. Did you end up solving it?
Try driver.switch_to.frame(driver.find_elements_by_tag_name(“iframe”)[1])
Traceback (most recent call last):
File “…./recaptcha.py”, line 87, in
driver.switch_to.frame(driver.find_elements_by_tag_name(“iframe”)[1])
IndexError: list index out of range
I did change “driver.switch_to_frame” to “driver.switch_to.frame”, but sill have the same error.
The same error
Me too …
Traceback (most recent call last):
File “D:\Python\shell.py”, line 87, in
driver.switch_to_frame(driver.find_elements_by_tag_name(“iframe”)[1])
IndexError: list index out of range
Hi,
You can try to use find_element_by_xpath:
driver.switch_to.frame(driver.find_element_by_xpath(“//iframe[@title=’recaptcha challenge’]”))
Have you find a way to get it working with the new updates ?
So far not.
Is this code still working?
I think, it does not work now.
It would be great if they could find a solution (in connection with the latest changes). I’m also looking for an option to transfer the received captcha to special sites for its decryption.
Thanks the brute force method sharing, and I have modified the python code to test our website(https://www.direct2drive.com/#!/pc).
# reference recaptcha solve selenium python
# The brute force approach works best for cracking this remotely supplied (by 3rd party) CAPTCHA.
import re, csv
from time import sleep, time
from random import uniform, randint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
def write_stat(loops, time):
with open(‘stat.csv’, ‘a’, newline=”) as csvfile:
spamwriter = csv.writer(csvfile, delimiter=’,’,
quotechar='”‘, quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow([loops, time])
def check_exists_by_xpath(xpath):
try:
driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True
def wait_between(a, b):
rand = uniform(a, b)
sleep(rand)
def dimention(driver):
d = int(driver.find_element_by_xpath(‘//div[@id=”rc-imageselect-target”]/table’).get_attribute(“class”)[-1]);
return d if d else 3 # dimention is 3 by default
# ***** main procedure to identify and submit picture solution
def solve_images(driver):
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, “rc-imageselect-target”)))
dim = dimention(driver)
# ****************** check if there is a clicked tile ******************
if check_exists_by_xpath(‘//div[@id=”rc-imageselect-target”]/table/tbody/tr/td[@class=”rc-imageselect-tileselected”]’):
rand2 = 0
else:
rand2 = 1
# wait before click on tiles
wait_between(0.5, 1.0)
# ****************** click on a tile ******************
tile1 = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, ‘//div[@id=”rc-imageselect-target”]/table/tbody/tr[{0}]/td[{1}]’.format(
randint(1, dim), randint(1, dim))))
)
tile1.click()
if (rand2):
try:
driver.find_element_by_xpath(
‘//div[@id=”rc-imageselect-target”]/table/tbody/tr[{0}]/td[{1}]’.format(randint(1, dim),
randint(1, dim))).click()
except NoSuchElementException:
print(‘\n\r No Such Element Exception for finding 2nd tile’)
# ****************** click on submit buttion ******************
driver.find_element_by_id(“recaptcha-verify-button”).click()
print(“start…”)
start = time()
# go to D2D website
url = ‘https://www.direct2drive.com/#!/pc’
driver = webdriver.Firefox()
driver.get(url)
# open login/sign up page
login_btn = driver.find_element_by_xpath(“//ul[@id=’navMenu’]/li[2]/div/div/a”)
login_btn.click()
sleep(3)
mainWin = driver.current_window_handle
# move the driver to the first iFrame
element = driver.find_element_by_xpath(“//div[@id=’captcha’]/div/div/iframe”)
driver.switch_to.frame(element)
# ************* locate CheckBox **************
btn = driver.find_element_by_xpath(“//*[@id=’recaptcha-anchor-label’]”)
btn.click()
sleep(1)
# ***************** back to main window **************************************
# driver.switch_to.window(mainWin)
driver.switch_to.default_content()
wait_between(2.0, 2.5)
# ************ switch to the second iframe by tag name ******************
driver.switch_to.frame(driver.find_element_by_xpath(“//iframe[@title=’recaptcha challenge’]”))
i = 1
while i < 130:
print('\n\r{0}-th loop'.format(i))
# # ******** check if checkbox is checked at the 1st frame ***********
# driver.switch_to.window(mainWin)
# WebDriverWait(driver, 10).until(
# EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME, 'iframe'))
# )
# wait_between(1.0, 2.0)
# if check_exists_by_xpath('//span[@aria-checked="true"]'):
# import winsound
#
# winsound.Beep(400, 1500)
# write_stat(i, round(time() – start) – 1) # saving results into stat file
# break
#
# driver.switch_to.window(mainWin)
# # ********** To the second frame to solve pictures *************
# wait_between(0.3, 1.5)
# driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[1])
solve_images(driver)
i = i + 1
Not working.
Seems google have fully changed the reCaptcha structure…