Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. In this post we will consider some advantages and drawbacks of using WebDriver for web scraping.
1. WebDriver can simulate a real user working with a browser
2. WebDriver can scrape a web site using a specific browser
While many web scraping programs do use a real web browser for data extraction, in most cases the browser they use is WebBrowser Control, which is Internet Explorer. WebDriver, however, works not only with Internet Explorer but also with a variety of browsers such as Google Chrome, Firefox, Opera, HtmlUnit and even Android and iOS.
3. WebDriver can scrape complicated web pages with dynamic content
4. WebDriver is able to take screenshots of the webpage
It’s a fact that if you need to see what the web page looks like, you need a browser that can render it. WebDriver is a very convenient way to get those screenshots when you need them.
1. The program becomes quite large
Even if you need to scrape a small portion of data, your program needs to be linked with all Selenium WebDriver libraries (there are about 4-5 Mb of them in total), and also the driver executable needs to be installed for each browser you want to use during scraping (that may be about another 6 Mb, at least in the case of Chrome Driver). Therefore your program may grow from 10 Kb to 10 Mb!
2. A browser application needs to be started
When you use WebDriver to scrape web pages you load a whole web browser into the system memory. This not only takes time and consumes system resources, but also may cause your security subsystem to react (and even disallow your program to run).
3. The scraping process is slower
Since a browser waits until the whole web page is loaded, and only then allows you to access its elements, the scraping process may take longer in comparison with making simple HTTP requests to the web server.
4. The browser generates a bigger network traffic
Web browsers load a lot of supplementary files that may be of no value for you (like css, js and image files). This may generate much, much more traffic than when you only request the resources that you really need (using separate HTTP requests).
5. The scraping can be detected by such simple means as Google Analytics
9 replies on “Pros and Cons of using Selenium WebDriver for Website Scraping”
This weblog provides valuable information to us I enjoy reading your posts.
Yes, Selnium is very powerful !
“The scraping can be detected by such simple means as Google Analytics.”
How is this any different from other methods? Won’t all methods of screen scraping show up in Analytics data? Does it also depend on implementation (I.E how are often you scrape)?
Kiran, the [Selenuim] WebDriver invokes and runs the real instane of a web browser that drives all in-web-page JS. When you use [for scraping] server-side languages (php, python, etc.) the in-web-page JS has no way to be executed since the extracted HTML (along with JS in it) is transformed into DOM structure and data parsed. Also the linked (in html head section) JS and CSS files are not requested at all.
True, since Analytic tools send XHRs to their home server with “in-browser user [or bot] activities”. The oftener, the more suspicious.
All the requesting to the server will be counted as the [website] server statistic data. Yet, what happens at the user-PC-side in browser in known only thru in-page (JS) Analytic’s feedback (by XHRs).
But you can just use Block Origin to cut off Google Analytics.
Nice article! In case of being detected by Google Analytics. You can setup a Firefox profile which has AdBlocker enabled with EasyPrivacy, so Google Analytic tracking scripts will be disabled. Also by blocking Ads you might save some bandwidth along the way 😉
Very informative article! Thank you.
i scrape a page which contains real time data. I load once the page through selenium and then pull the data every 2 seconds. Do you agree that this solution is better than making 30 requests per minute to the server or am i missing something?
What do you mean: “pull the data every 2 seconds” ? Where do you pull data from?