Recently I decided to work with pythonanywhere.com for running python scripts on JS stuffed websites.
Originally I tried to leverage the dryscrape library, but I failed to do it, and a nice support explained to me: “…unfortunately dryscrape depends on WebKit, and WebKit doesn’t work with our virtualisation system.”
So they directed me to the Selenium + Firefox bundle as guided in this post.
In short, I installed pyvirtualdisplay, a python wrapper for Xvfb (stands for X virtual framebuffer), for running a display inside ‘X virtual framebuffer‘. So, initiating the Xvfb (buffer) causes the rendering of the Firefox browser into it, thus forcing output in it. This is the way a headless browser is simulated.
To install pyvirtualdisplay in a Bash console:
$ pip3.5 install --user pyvirtualdisplay
“Headless” servers can use a virtual display like Xvfb to spoof apps like Firefox into running if there’s no real screen for them to actually be displayed on.
Now, having Selenuim, Firefox and running all the display inside ‘X virtual framebuffer‘, I composed the simple python scraper program based on Corey Goldberg’s code, Selenium and BeautifulSoup being preinstalled in pythonanywhere.
The Python code:
from pyvirtualdisplay import Display from selenium import webdriver from bs4 import BeautifulSoup # imports to work with loaded page import time from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.keys import Keys # initiate virtual display display = Display(visible=0, size=(800, 600)) display.start() try: # we can now start Firefox and it will run inside the virtual display browser = webdriver.Firefox() browser.get('https://www.bing.com') print browser.title # should print out "Bing" # implicit wait search = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.NAME, "go"))) search.send_keys("python headless browser scrape") search.send_keys(Keys.RETURN) soup = BeautifulSoup(browser.page_source) for a in soup.findAll('a'): print(a.attrs.get('href')) finally: browser.quit() display.stop() # ignore any output from this.