Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. In this post we will consider some advantages and drawbacks of using WebDriver for web scraping.
The Advantages
1. WebDriver can simulate a real user working with a browser
Since WebDriver uses a real web browser to access the web site, its activity does not differ from the activity of an ordinary user surfing the web. When you load a web page using WebDriver, the browser consequently loads all the web site resources (javascript files, images, css files and so on…) and executes all javascripts on the page. At the same time it stores all the cookies created by websites and sends complete HTTP headers as all browsers do. This makes it very hard to determine whether a real person accesses the web site or a robot. While it’s really burdensome to simulate all these actions in a program that sends “handmade” HTTP requests to the server, with WebDriver you can do it in several simple steps.
2. WebDriver can scrape a web site using a specific browser
While many web scraping programs do use a real web browser for data extraction, in most cases the browser they use is WebBrowser Control, which is Internet Explorer. WebDriver, however, works not only with Internet Explorer but also with a variety of browsers such as Google Chrome, Firefox, Opera, HtmlUnit and even Android and iOS.
3. WebDriver can scrape complicated web pages with dynamic content
Sometimes the data you need to extract is not in that raw HTML you got after doing an HTTP request. It may be generated dynamically (using AJAX and JavaScript, as in our test case). Though it is still possible to get this data with merely HTTP requests (by analyzing the traffic and javascript code that processes the data), it’s often much easier to let a web browser do it for you. In this case WebDriver comes to the rescue.
4. WebDriver is able to take screenshots of the webpage
It’s a fact that if you need to see what the web page looks like, you need a browser that can render it. WebDriver is a very convenient way to get those screenshots when you need them.
The Drawbacks
1. The program becomes quite large
Even if you need to scrape a small portion of data, your program needs to be linked with all Selenium WebDriver libraries (there are about 4-5 Mb of them in total), and also the driver executable needs to be installed for each browser you want to use during scraping (that may be about another 6 Mb, at least in the case of Chrome Driver). Therefore your program may grow from 10 Kb to 10 Mb!
2. A browser application needs to be started
When you use WebDriver to scrape web pages you load a whole web browser into the system memory. This not only takes time and consumes system resources, but also may cause your security subsystem to react (and even disallow your program to run).
3. The scraping process is slower
Since a browser waits until the whole web page is loaded, and only then allows you to access its elements, the scraping process may take longer in comparison with making simple HTTP requests to the web server.
4. The browser generates a bigger network traffic
Web browsers load a lot of supplementary files that may be of no value for you (like css, js and image files). This may generate much, much more traffic than when you only request the resources that you really need (using separate HTTP requests).
5. The scraping can be detected by such simple means as Google Analytics
If you scrape too many pages using WebDriver, you can be easily detected by any JavaScript-based traffic-tracking tools (like Google Analytics). The web site owner does not even need to install any sophisticated scrape bot detection mechanism!
Conclusion
All the drawbacks mentioned above follow from the fact that Selenium WebDriver is not primarily intended to be used for web scraping (its sphere is browser automation), but as web scraping specialists, we can still take great advantage from having it in our tool set as a powerful scraping tool. It is really not hard to integrate it into almost any web scraping solution written in Java, C#, Ruby, Python, JavaScript (Node.js) and even PHP, but in the end it is up to you whether to use it or not. I hope that this article will be helpful in making the right decision.
9 replies on “Pros and Cons of using Selenium WebDriver for Website Scraping”
This weblog provides valuable information to us I enjoy reading your posts.
Yes, Selnium is very powerful !
You state:
“The scraping can be detected by such simple means as Google Analytics.”
How is this any different from other methods? Won’t all methods of screen scraping show up in Analytics data? Does it also depend on implementation (I.E how are often you scrape)?
Kiran, the [Selenuim] WebDriver invokes and runs the real instane of a web browser that drives all in-web-page JS. When you use [for scraping] server-side languages (php, python, etc.) the in-web-page JS has no way to be executed since the extracted HTML (along with JS in it) is transformed into DOM structure and data parsed. Also the linked (in html head section) JS and CSS files are not requested at all.
True, since Analytic tools send XHRs to their home server with “in-browser user [or bot] activities”. The oftener, the more suspicious.
All the requesting to the server will be counted as the [website] server statistic data. Yet, what happens at the user-PC-side in browser in known only thru in-page (JS) Analytic’s feedback (by XHRs).
But you can just use Block Origin to cut off Google Analytics.
Nice article! In case of being detected by Google Analytics. You can setup a Firefox profile which has AdBlocker enabled with EasyPrivacy, so Google Analytic tracking scripts will be disabled. Also by blocking Ads you might save some bandwidth along the way 😉
Cheers!
Rick
Very informative article! Thank you.
i scrape a page which contains real time data. I load once the page through selenium and then pull the data every 2 seconds. Do you agree that this solution is better than making 30 requests per minute to the server or am i missing something?
What do you mean: “pull the data every 2 seconds” ? Where do you pull data from?