Recently, I was challenged to do bulk submits through an authenticated form. The website required a login. While there are plenty of examples of how to use POST and GET in Python, I want to share with you how I handled the session along with a cookie and authenticity token (CSRF-like protection).
In the post, we are going to cover the crucial techniques needed in the scripting web scraping:
- persistent session usage
- cookie finding and storing [in session]
- “auth token” finding, retrieving and submitting in a form
Given
A website with an input form where auth token is present. The auth token (CSRF-like) is different each time the form gets loaded. The website requires a login.
What I want:
I want to submit a lot of similar input data like ‘GE 1’, ‘GE 2’, etc. through that format into my account.
The main steps necessary to achieve the goal
- Get a cookie from a logged-in browser.
- Insert cookie into a session (of Python requests library).
- Fetch the current form hidden “auth token” (using regex) before each submit.
- Use that unique “auth token” for each POST request inside the session.
1. Getting a cookie value from the browser
We get the cookie value(s) using the web developer tools (F12 in most browsers). Look at the following picture (a picture is better than 1000 words):
2. Adding the cookie into a session object
First, we add a cookie(s) into *.cookie file at a disk using a pickle module.
with open(cookieFile, 'rb') as f:
print("Loading cookies...")
session.cookies.update(pickle.load(f))
Second, every time that we activate the session, we add the file into the session object. All cookies are thus joined into a session.
## One time cookie saving into a file
import pickle
URL = 'http://www.excellentbeliever.com/'
urlData = urlparse(URL)
cookieFile = urlData.netloc + '.cookie'
cookie1={'_exbel_session':'63b55ca6.............a2a5215e'}
with open(cookieFile, 'wb') as fp:
pickle.dump(cookie1, fp)
After we have loaded the cookie, we start scripting.
Main operations inside a loop
Inside the loop, over the input values, we do the following:
-
Visit the page with the form and fetch the “auth token”
How to identify a form’s hidden field value? See the figure below:
The code to extract the form’s hidden input by regex:
regex_auth = r'(?:name="authenticity_token")\s+value="(.*?)"' page = session.get( urljoin(URL, '/dashboard?prediction=false')) matches = re.findall(regex_auth, page.text, re.MULTILINE) auth_token = matches[0]
-
Make a POST request to submit data
pattern = 'GE ' post_data = {"utf8": "✓", "authenticity_token": auth_token , "fragment": pattern + str(i), 'commit': 'Post Reading!'} post_URL = urljoin(URL,'/readings') page = session.post( post_URL , data = post_data)
The whole code
import os, re
import pickle, requests
from urllib.parse import urljoin, urlparse
# init vars
URL = 'http://www.excellentbeliever.com/'
regex_auth = r'(?:name="authenticity_token")\s+value="(.*?)"'
urlData = urlparse(URL)
cookieFile = urlData.netloc + '.cookie'
## One time cookie saving into a file
##cookie1={'_exbel_session':'63b5568921de51fe67fe847ca2a5215e'}
##with open(cookieFile, 'wb') as fp:
## pickle.dump(cookie1, fp)
##print ('cookieFile:', cookieFile)
login='xxx'
password='xxx'
signinUrl = urljoin(URL, "users/sign_in") # http://www.excellentbeliever.com/users/sign_in
with requests.Session() as session:
try:
with open(cookieFile, 'rb') as f:
print("Loading cookies...")
session.cookies.update(pickle.load(f))
except Exception:
# If could not load cookies from file, get the new ones by login in
print("Login in...")
post = session.post(
signinUrl,
data={
'email': login,
'password': password,
}
)
try:
with open(cookieFile, 'wb') as f:
jar = requests.cookies.RequestsCookieJar()
for cookie in session.cookies:
if cookie.name in persistentCookieNames:
jar.set_cookie(cookie)
pickle.dump(jar, f)
except Exception as e:
os.remove(cookieFile)
raise(e)
# load headers
session.headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Origin': URL,
'Upgrade-Insecure-Requests': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
page = session.get(URL)
print ('url:', URL)
print ('status code:', page.status_code)
login_marker = 'Igor Savinkin'
if login_marker in page.text:
print (login_marker , 'is logged in.' )
print ("Session cookies:", session.cookies)
pattern='GE '
max_num=26
for i in range(26, max_num+1):
# get the auth token from authenticated form
print ('Get the token authenticated form')
page = session.get( urljoin(URL, '/dashboard?prediction=false'))
print ('Page with form status code:', page.status_code)
matches = re.findall(regex_auth, page.text, re.MULTILINE)
if matches:
auth_token = matches[0]
print ('Form auth token:', auth_token)
post_data = {"utf8": "✓", "authenticity_token": auth_token ,
"fragment": pattern + str(i), 'commit': 'Post Reading!'}
post_URL = urljoin(URL,'/readings')
else:
exit('Smth went wrong.')
# send a post
page = session.post( post_URL , data = post_data)
print ('POST submit status code:', page.status_code)
if 'Successfully' in page.text:
print ('Form with "' + post_data["fragment"] + '" has been successfully submitted.' )