Categories
Development

Tutorial: How to use Headless Firefox for Scraping in Linux

I have already written several articles on how to use Selenium WebDriver for web scraping and all those examples were for Windows. But what about if you want to run your WebDriver-based scraper somewhere on a headless Linux server? For example on a Virtual Private Server with SSH-only access. Here I will show you how to do it in several simple steps.

Let’s say you already have a virtual or dedicated Debian server with Python installed. The following tutorial will guide you from installing all necessary software to running your first WebDriver-based scraping program in Python. I assume that you are logged in as an administrator.

1. Install Xvfb

Since your server doesn’t have a screen to run FireFox you need to simulate one. Xvfb is a software that simulates a display doing everything in memory and not showing any screen output. You can install it with a simple command:

apt-get install xvfb

2. Install Firefox

If Firefox is not installed on your system you can install it in the following way:

apt-get remove iceweasel
echo -e "\ndeb http://downloads.sourceforge.net/project/ubuntuzilla/mozilla/apt all main" | tee -a /etc/apt/sources.list > /dev/null
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com C1289A29
apt-get update
apt-get install firefox-mozilla-build
apt-get install libdbus-glib-1-2
apt-get install libgtk2.0-0
apt-get install libasound2

The first command removes a native Debian browser Iceweasel (if it is installed on your system). Then we add a package repository that contains Firefox, install the corresponding key and update the local package list. After that we install Firefox with some libraries (some of them may probably be already installed on your system).

3. Install PyVirtualDisplay

PyVirtualDisplay is a Python wrapper for Xvfb. It allows you to easily work with a virtual display in Python. Installation is simple:

pip install pyvirtualdisplay

If you don’t have pip on your system you can install it with the following command:

curl --silent --show-error --retry 5 https://raw.github.com/pypa/pip/master/contrib/get-pip.py | sudo python

4. Install Selenium

To install Selenium you can run the following:

pip install selenium

5. Run a simple scraping program

Now we’re ready to run a simple program that uses Firefox for scraping Google’s home page title (I found this code here):

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(800, 600))
display.start()

browser = webdriver.Firefox()
browser.get('http://www.google.com')
print browser.title
browser.quit()

display.stop()

That’s it!

15 replies on “Tutorial: How to use Headless Firefox for Scraping in Linux”

Pity you can’t use it then, eh?

“No part of this website or any of its contents may be reproduced, copied, modified or adapted, without the prior written consent of the author”

Pfffft.

Jonathan, you MAY really use site’s content. You just do not reproduce, copy, modify or adapt it as your own. If you copy something for use (or reposting), just include a link/reference to the original blog post.

The problem is, if I can’t copy the code, then I can’t use the tutorial, and i cant test if it works.

If I do, and then use it, I am now violating the tos, which is a felony.

PROTIP: By HTTP GETing this web page, it got alread copied and distributed many MANY times. From device to device, from buffer to buffer, from chip to chip, from display to eyes and walls and maybe windows and whatnot.
PROTIP: It is literally physically impossible to control, or even *know* if somebody copied it. You may be physically (and more often also practically) unable to ever tell the difference in your universe! We might all have shared it behind your back, and never told you. This might even habe happened outside of the event horizon of your light cone! (So practically outside of your universe.) Making the term “ownership” lose all meaning and be as undefined for information, as “north of here” is for the north pole.

This is the same insanely clueless nonsense as DRM or the entire imaginary property scam.

How do people even manage to become IT professionals, and not get (or stay impervious to) such basic facts of nature as that information cannot be owned/stolen/sold/rented?!?
Seriously… How?

Thank you for posting this. I’ve been looking for something as simple as this to get screenshots from Firefox on a headless server. Nice and simple method.

If you read what the poster said, you only need to attribute the source to use it. That is simply comment the code to state where the code came from. Is that too much of a hassle?

This really helped me. I am working on a problem where the requested URLs cannot be escaped when the request is made to the server due to a limitation of the web application. PhantomJS (as well as urllib, urllib2, and requests) all use percent-escaping on URLs to comply with the RFCs on the subject. However, Firefox, Chrome, and Safari send the URLs intact.

In order to use Firefox, however, I needed a headless option. You have helped explain that option to me with this post, and it has really helped me out. Thanks!

Thanks for blogging this. The pyvirtualdisplay package was the secret ingredient that made all the packages play nicely together!
Here: headless server with Debian Wheezy.
Just works.

Can anyone help me how can I use it with java?
May be its not a good question but I am new to selenium and I want to run My test case headlessly on jenkins live server. Plz help me I have to live my test and I need it immediatly?I tried with phantomjs but it is not working.

thanks & regards:
Muhammad Balal

Great post, thanks.
I am new to scraping and need help in getting the structure of a page (e.g., the id’s, names or xpaths to the divs, the links, etc.) as aid in creating automation of tests.
Can anyone point to some sample code?
Much appreciated.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.