Categories
Development

Headless browser python scraper at pythonanywhere

Recently I decided to work with pythonanywhere.com for running python scripts on JS stuffed websites.

Originally I tried to leverage the dryscrape library, but I failed to do it, and a nice support explained to me: “…unfortunately dryscrape depends on WebKit, and WebKit doesn’t work with our virtualisation system.”

A headless browser is by definition a web browser without a graphical user interface (GUI).

So they directed me to the Selenium + Firefox bundle as guided in this post.
In short, I installed pyvirtualdisplay, a python wrapper for Xvfb (stands for X virtual framebuffer), for running a display inside ‘X virtual framebuffer‘. So, initiating the Xvfb (buffer) causes the rendering of the Firefox browser into it, thus forcing output in it. This is the way a headless browser is simulated.

To install pyvirtualdisplay in a Bash console:bash_console_python_anywhere

$ pip3.5 install --user pyvirtualdisplay

From wiki: Xvfb or X Virtual FrameBuffer is a display server implementing the X11 display server protocol. In contrast to other display servers, Xvfb performs all graphical operations in memory without showing any screen output. From the point of view of the client, it acts exactly like any other X display server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer running on to have a screen or any input device. Only a network layer is necessary.

“Headless” servers can use a virtual display like Xvfb to spoof apps like Firefox into running if there’s no real screen for them to actually be displayed on.

Now, having Selenuim, Firefox and running all the display inside ‘X virtual framebuffer‘, I composed the simple python scraper program based on Corey Goldberg’s code, Selenium and BeautifulSoup being preinstalled in pythonanywhere.

The Python code:
from pyvirtualdisplay import Display
from selenium import webdriver
from bs4 import BeautifulSoup 

# imports to work with loaded page
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

# initiate virtual display
display = Display(visible=0, size=(800, 600))
display.start()

try:
    # we can now start Firefox and it will run inside the virtual display
    browser = webdriver.Firefox()
    browser.get('https://www.bing.com')
    print browser.title # should print out "Bing" 
    # implicit wait
    search = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.NAME, "go")))
    search.send_keys("python headless browser scrape")
    search.send_keys(Keys.RETURN)
    soup = BeautifulSoup(browser.page_source)
    for a in soup.findAll('a'):
        print(a.attrs.get('href'))
    
finally: 
    browser.quit()
    display.stop() # ignore any output from this.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.