Categories
Development

Python requests vs urllib2 for JS-stuffed website scrape

Question:

The Python requests library is a useful library having tons of advantages compared to other similar libraries. However, as I was trying to retrieve the Wikipedia pagerequests.get() retrieved it only partially:

response = requests.get('https://en.wikipedia.org/wiki/Talk:Land_value_tax', verify=False)
html = response.text

I tried it using urllib2, and urllib2.urlopen retrieved the same page completely:

html = urllib2.urlopen('https://en.wikipedia.org/wiki/Talk:Land_value_tax').read()

Why does this happen, and how can one solve it using requests?

Answer:

It seems to me that the problem lies in the scripting on the target page. The js-driven content is rendered in here (especially I’ve found calls to mediawiki). So, let’s look at a web sniffer to identify it: python-requests-urllib2

What to do? If you want to retrieve the whole page content, it is best to plugin any of libraries working out (evaluating) on-page JavaScript. Read more here. The python libraries are selenium, dryscrapepyV8.

Later, the one who asked added a comment:

I am not interested in retrieving the whole page and statistics or JS libraries retrieved from MediaWiki. I only need the whole content of the page (through scraping, not MediaWiki API).

The issue is that those JS calls to other resources (incl. mediawiki) make it possible to render the whole page to the client (by a browser), but since the requests library does not support JS execution, JS is not executed => page parts are not loaded from other resources => the target page is not loaded as a whole as it might be in a browser.

One reply on “Python requests vs urllib2 for JS-stuffed website scrape”

With just about all websites are now rendered mainly with Javascript the future of scraping certainly seems to lie with the use of complete web engines such as qtwebengine. Chromium itself is now able to be run in headless mode to fulfill such tasks.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.