Recently I decided to work with pythonanywhere.com for running python scripts on JS stuffed websites.
Originally I tried to leverage the dryscrape library, but I failed to do it, and a nice support explained to me: “…unfortunately dryscrape depends on WebKit, and WebKit doesn’t work with our virtualisation system.”
So they directed me to the Selenium + Firefox bundle as guided in this post.
In short, I installed pyvirtualdisplay, a python wrapper for Xvfb (stands for X virtual framebuffer), for running a display inside ‘X virtual framebuffer‘. So, initiating the Xvfb (buffer) causes the rendering of the Firefox browser into it, thus forcing output in it. This is the way a headless browser is simulated.
To install pyvirtualdisplay in a Bash console:
$ pip3.5 install --user pyvirtualdisplay
“Headless” servers can use a virtual display like Xvfb to spoof apps like Firefox into running if there’s no real screen for them to actually be displayed on.
Now, having Selenuim, Firefox and running all the display inside ‘X virtual framebuffer‘, I composed the simple python scraper program based on Corey Goldberg’s code, Selenium and BeautifulSoup being preinstalled in pythonanywhere.
The Python code:
from pyvirtualdisplay import Display
from selenium import webdriver
from bs4 import BeautifulSoup
# imports to work with loaded page
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
# initiate virtual display
display = Display(visible=0, size=(800, 600))
display.start()
try:
# we can now start Firefox and it will run inside the virtual display
browser = webdriver.Firefox()
browser.get('https://www.bing.com')
print browser.title # should print out "Bing"
# implicit wait
search = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.NAME, "go")))
search.send_keys("python headless browser scrape")
search.send_keys(Keys.RETURN)
soup = BeautifulSoup(browser.page_source)
for a in soup.findAll('a'):
print(a.attrs.get('href'))
finally:
browser.quit()
display.stop() # ignore any output from this.