I have already written several articles on how to use Selenium WebDriver for web scraping and all those examples were for Windows. But what about if you want to run your WebDriver-based scraper somewhere on a headless Linux server? For example on a Virtual Private Server with SSH-only access. Here I will show you how to do it in several simple steps.Let’s say you already have a virtual or dedicated Debian server with Python installed. The following tutorial will guide you from installing all necessary software to running your first WebDriver-based scraping program in Python. I assume that you are logged in as an administrator.
1. Install Xvfb
Since your server doesn’t have a screen to run FireFox you need to simulate one. Xvfb is a software that simulates a display doing everything in memory and not showing any screen output. You can install it with a simple command:
apt-get install xvfb
2. Install Firefox
If Firefox is not installed on your system you can install it in the following way:
apt-get remove iceweasel echo -e "\ndeb http://downloads.sourceforge.net/project/ubuntuzilla/mozilla/apt all main" | tee -a /etc/apt/sources.list > /dev/null apt-key adv --recv-keys --keyserver keyserver.ubuntu.com C1289A29 apt-get update apt-get install firefox-mozilla-build apt-get install libdbus-glib-1-2 apt-get install libgtk2.0-0 apt-get install libasound2
The first command removes a native Debian browser Iceweasel (if it is installed on your system). Then we add a package repository that contains Firefox, install the corresponding key and update the local package list. After that we install Firefox with some libraries (some of them may probably be already installed on your system).
3. Install PyVirtualDisplay
PyVirtualDisplay is a Python wrapper for Xvfb. It allows you to easily work with a virtual display in Python. Installation is simple:
pip install pyvirtualdisplay
If you don’t have pip on your system you can install it with the following command:
curl --silent --show-error --retry 5 https://raw.github.com/pypa/pip/master/contrib/get-pip.py | sudo python
4. Install Selenium
To install Selenium you can run the following:
pip install selenium
5. Run a simple scraping program
Now we’re ready to run a simple program that uses Firefox for scraping Google’s home page title (I found this code here):
from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() browser = webdriver.Firefox() browser.get('http://www.google.com') print browser.title browser.quit() display.stop()
15 replies on “Tutorial: How to use Headless Firefox for Scraping in Linux”
Wow finally a tutorial that:
1. Was not long, complicated
Pity you can’t use it then, eh?
“No part of this website or any of its contents may be reproduced, copied, modified or adapted, without the prior written consent of the author”
Jonathan, you MAY really use site’s content. You just do not reproduce, copy, modify or adapt it as your own. If you copy something for use (or reposting), just include a link/reference to the original blog post.
The problem is, if I can’t copy the code, then I can’t use the tutorial, and i cant test if it works.
If I do, and then use it, I am now violating the tos, which is a felony.
PROTIP: By HTTP GETing this web page, it got alread copied and distributed many MANY times. From device to device, from buffer to buffer, from chip to chip, from display to eyes and walls and maybe windows and whatnot.
PROTIP: It is literally physically impossible to control, or even *know* if somebody copied it. You may be physically (and more often also practically) unable to ever tell the difference in your universe! We might all have shared it behind your back, and never told you. This might even habe happened outside of the event horizon of your light cone! (So practically outside of your universe.) Making the term “ownership” lose all meaning and be as undefined for information, as “north of here” is for the north pole.
This is the same insanely clueless nonsense as DRM or the entire imaginary property scam.
How do people even manage to become IT professionals, and not get (or stay impervious to) such basic facts of nature as that information cannot be owned/stolen/sold/rented?!?
Thank you for posting this. I’ve been looking for something as simple as this to get screenshots from Firefox on a headless server. Nice and simple method.
If you read what the poster said, you only need to attribute the source to use it. That is simply comment the code to state where the code came from. Is that too much of a hassle?
Perfect! You saved me so much trouble! Thanks a ton.
I just wanted to thank you, easy, quick and worked.
Worked a treat!
This really helped me. I am working on a problem where the requested URLs cannot be escaped when the request is made to the server due to a limitation of the web application. PhantomJS (as well as urllib, urllib2, and requests) all use percent-escaping on URLs to comply with the RFCs on the subject. However, Firefox, Chrome, and Safari send the URLs intact.
In order to use Firefox, however, I needed a headless option. You have helped explain that option to me with this post, and it has really helped me out. Thanks!
Thanks for blogging this. The pyvirtualdisplay package was the secret ingredient that made all the packages play nicely together!
Here: headless server with Debian Wheezy.
Can anyone help me how can I use it with java?
May be its not a good question but I am new to selenium and I want to run My test case headlessly on jenkins live server. Plz help me I have to live my test and I need it immediatly?I tried with phantomjs but it is not working.
thanks & regards:
Great post, thanks.
I am new to scraping and need help in getting the structure of a page (e.g., the id’s, names or xpaths to the divs, the links, etc.) as aid in creating automation of tests.
Can anyone point to some sample code?
You may want to add installing geckodriver to your blog, without it this fails with the following “selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.”