Some of you may be wondering if it’s possible to extract a web browser’s local storage by web scraping?
Local storage in a nutshell is when a website stores data on your machine instead of having to call the server for it every time. Local storage is more secure than cookies, and large amounts of data can be stored locally, without affecting website performance. It’s accessible through browser scripting, for example JavaScript.
Why JavaScript to fetch local storage, is there another way to get it?
Here’s some from the Wikipedia’s definition of local storage:
So in my view the local storage is data stored by web browser (ex. Opera) somewhere on your hard drive (or cloud machine) where browser is run. So to fetch them you need to locally hack Opera’s data files, which is much harder. I think the simplest way is to apply the client-scripting, namely JavaScript.
Python and Selenium
None of the high level programming languages invoke a browser instance, they request and extract pure HTML only. So if we want to access the browser’s local storage when scraping a page, we need to invoke both a browser instance and leverage a JavaScript interpreter to read the local storage. For my money, Selenium is the best solution.
Here’s how to leverage custom scripting through Selenium’s framework upon a web browser instance.
A possible replacement for Selenium is PhantomJS, running a headless browser.
JaveScript to iterate over localStorage browser object
for (var i = 0; i < localStorage.length; i++){ key=localStorage.key(i); console.log(key+': '+localStorage.getItem(key)); }
Advanced script
As mentioned here a HTML5 featured browser should also implement Array.prototype.map
. So script would be:
Array.apply(0, new Array(localStorage.length)).map(function (o, i) { return localStorage.key(i)+':'+localStorage.getItem(localStorage.key(i)); } )
Python with Selenium script for setting up and scraping local storage
from selenium import webdriver driver = webdriver.Firefox() url='http://www.w3schools.com/' driver.get(url) scriptArray="""localStorage.setItem("key1", 'new item'); localStorage.setItem("key2", 'second item'); return Array.apply(0, new Array(localStorage.length)).map(function (o, i) { return localStorage.getItem(localStorage.key(i)); } )""" result = driver.execute_script(scriptArray) print(result)
Python bindings alternative to Python+Selenium
Some might argue Selenium is inefficient for only local storage extracting. If you think Selenium is too bulky, you might want to try a Python binding with a development framework for desktop, ex. PyQt. Something I might touch on in a later post.
5 replies on “Extract browser’s Local Storage with Python”
This is excellent post. I was wondering how to get dynamic captcha in selenium besides screenshot. Now, this looks like a real solution. Will test it some day.
Ok. Karl. You might also read how to break reCaptcha v2.0 with Selenium applying brute force.
That’s absolutely a cool solution, and since I subscribed your blog, so I actually have read it. I use DeathbyCaptcha API to solve captcha, and it’s a easier solution (and cheap too).
Is there a way to get all the local storage files of a browser session? This method only gets the local storage files of the website we are currently on.
You can also just read the Web Storage (a/k/a local storage) files for your browser. Here’s how to do it with Chrome:
http://stackoverflow.com/questions/23454119/how-to-read-modify-a-local-file-of-html5-local-storage-from-python
It makes use of the fact that Chrome and Opera use “SQLite format 3” for Web Storage (a/k/a Local Storage, or DOM Storage). Under Windows 10 Chrome currently keeps its Web Storage files in this folder:
“%LOCALAPPDATA%\Google\Chrome\User Data\Default\Local Storage\”
Opera should be similar. (Old Opera used XML files, but recent versions of Opera are basically forks of Chrome / Chromium.)
Firefox is similar, except that it appears that Firefox uses one great big SQLite database for all Web Storage for all web pages; see Hugh Lee’s and Kevin Hakanson’s answers here:
http://stackoverflow.com/questions/7079075/where-does-firefox-store-javascript-html-localstorage