Scraping a Javascript-dependent website with puppeteer

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Problem scraping javascript site – help needed

Problem

I am trying to scrape the page https://tienda.mercadona.es/categories/112 and I have installed the docker and followed all the required steps given in the post. Splash works well, but the spyder does not and I don’t know why. The IP of the splash_url is correct but I can’t see in the response object when I write scrapy shell “webpage” the complete page, ie, the page has not rendered correctly.

Download a file from a link in Python

I recently got a question and it looked like this : how to download a file from a link in Python?

“I need to go to every link which will open a website and that would have the download file “Export offers to XML”. This link is javascript enabled.”

Let us consider how to get a file from a JS-driven weblink using Python :

How do I get pass dynamic “load more” btn?

Recently I’ve got a question:

How do I get pass the dynamic “load more” button using a Python web scraper?

JavaScript rendering library for scraping javascript sites

Can you imagine how many scraping instruments are at our service? Though it has a long history, scraping has at last become a multi-lingual and simple approach. Unfortunately, there is a list of non-trivial tasks which can’t be resolved in a snap.

One of these tasks is scraping javascript sites, those that output data using JavaScript. Facing this task, classic scrapers (not all of them though) ignore JS-data and continue their own life-cycle. However, when this little defect becomes a big trouble, developers all over the world take measures. And they did it! Today we consider one of the most awesome tools which scrapes JS-generated data – Splash.

JavaScript rendering library for scraping javascript sites

logo-js-rendering-libraryCan you imagine how many scraping instruments are at our service? Though it has a long history, scraping has at last become a multi-lingual and simple approach. Unfortunately, there is a list of non-trivial tasks which can’t be resolved in a snap.

One of these tasks is scraping javascript sites, those that output data using JavaScript. Facing this task, classic scrapers (not all of them though) ignore JS-data and continue their own life-cycle. However, when this little defect becomes a big trouble, developers all over the world take measures. And they did it! Today we consider one of the most awesome tools which scrapes JS-generated data – Splash.

Web scraping with JavaScript

Is it possible to scrape an HTML page with JavaScript from inside of a web browser?

To be perfectly honest I wasn’t sure so I decided to try it out.

Full disclaimer here, I didn’t actually succeed. However, it was a great learning experience for me and I think you guys could benefit from seeing what I did and where I went wrong. Who knows, maybe you can take what I’ve done and figure it out for yourself!

DOM elements number counter and sum up

I wanna provide you with a nice utility for quick summing of multiple DOM element values. Why? Well, suppose you’ve at a page like this and you want to sum up the total number of hotels in all the countries.