Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. In this post we will consider some advantages and drawbacks of using WebDriver for web scraping.
1. WebDriver can simulate a real user working with a browser
2. WebDriver can scrape a web site using a specific browser
While many web scraping programs do use a real web browser for data extraction, in most cases the browser they use is WebBrowser Control, which is Internet Explorer. WebDriver, however, works not only with Internet Explorer but also with a variety of browsers such as Google Chrome, Firefox, Opera, HtmlUnit and even Android and iOS.
3. WebDriver can scrape complicated web pages with dynamic content
4. WebDriver is able to take screenshots of the webpage
It’s a fact that if you need to see what the web page looks like, you need a browser that can render it. WebDriver is a very convenient way to get those screenshots when you need them.
1. The program becomes quite large
Even if you need to scrape a small portion of data, your program needs to be linked with all Selenium WebDriver libraries (there are about 4-5 Mb of them in total), and also the driver executable needs to be installed for each browser you want to use during scraping (that may be about another 6 Mb, at least in the case of Chrome Driver). Therefore your program may grow from 10 Kb to 10 Mb!
2. A browser application needs to be started
When you use WebDriver to scrape web pages you load a whole web browser into the system memory. This not only takes time and consumes system resources, but also may cause your security subsystem to react (and even disallow your program to run).
3. The scraping process is slower
Since a browser waits until the whole web page is loaded, and only then allows you to access its elements, the scraping process may take longer in comparison with making simple HTTP requests to the web server.
4. The browser generates a bigger network traffic
Web browsers load a lot of supplementary files that may be of no value for you (like css, js and image files). This may generate much, much more traffic than when you only request the resources that you really need (using separate HTTP requests).
5. The scraping can be detected by such simple means as Google Analytics