Scrapinghub is the developer-focused web scraping platform. It provides web scraping tools and services to extract structured information from online sources. The Scrapinghub platform also offers several useful services to collect organized data from the internet. Scrapinghub has four major tools – Scrapy Cloud, Crawlera, and Splash. We’ve decided to try the service. In this post we’ll review its main functionality and also share our experience with Scrapinghub.
If you have been associated with the web scraping industry (especially in development), you might have heard of Scrapy, the open source data extracting framework. One can create, run, and manage web crawlers with Scrapy easily. For the heavy lifting scraping (e.g. manual server operations, periodic actions, maintenance, etc), Scrapinghub’s Scrapy Cloud automates and visualizes your Scrapy web spiders’ activities.
However, the Scrapy Cloud will limit your ability to scrape data from websites. It has some built-in tools which you can utilize to extract information. If you host Scrapy on your own, you can use the Python based framework to write and run spiders more effectively.
Scrapy Cloud Pricing
Price range for Scrapy Cloud goes from free to $300 per month.
- The free plan allows you to run only one concurrent crawl (see more to the right).
- $25 and $50 plans support 4 concurrent crawls. This scales to 8 and 16 concurrent crawls, if you spend $150 or $350 respectively. Additional benefits are provided in higher valued packages.
- The CPU and RAM options vary from plan to plan. For example, in the $25/mo plan, you get only shared access to the server computer’s RAM. But in the $50/mo plan, you would get 1.3GB of RAM. Each plan gets different amount of resources allocated.
- The free plan retains your scraped data for 7 days. You can extend this period to 120 days by purchasing any paid plan.
Your spiders may face bans by some web servers during crawling. This situation is frustrating because it hampers data extraction. Scrapinghub’s Crawlera is a solution to the IP ban problem. The service routes your spiders through thousands of different IP addresses. Crawlera has a good collection of IP addresses of more than 50 countries. If a request gets banned from a specific IP, Crawlera executes it from another IP – performing persistently perfectly. Crawlera is able to detect 130+ ban types, server responses, captchas and takes the appropriate action (changing IPs, slowing down operation speed, etc). It functions adaptively to minimize IP bans. The system halts crawling in the worst situation (when the target server continuously rejects crawling requests).
We couldn’t find exactly how many failed extraction attempts would lead to Crawlera giving up, but generally it depends on the overall setup (ex. Splash browser timeout limit, the Scrapy Cloud package’s specifications, etc.)
Crawlera supports both HTTP and HTTPS proxies. The service is available as an addon in Scrapy Cloud, it being a main standalone product. The cost ranges from $99 (Crawlera Basic) up to $500 per month with available negotiable enterprise pricing.
Using Splash you can:
- Process HTML requests
- Write scripts using Lua programming language – for more customized browsing
- Take screenshots, etc.
Splash also supports ad blocker rules to accelerate the rendering speed. Splash functions run in a sandbox environment by default. But you can disable these restrictions with a simple command. The default timeout period of Splash browser is 30 seconds. This can cause problems with longer scripts and slower websites. However, this limit can be changed as well.
You may find more details about the headless, scriptable HTTP API browser on this official page. A premium subscription is required to use Splash on Scrapy Cloud.
Scrapinghub offers several support channels; email, forums, tweets and dashboard based messaging. I’ve mailed them some questions, but support didn’t respond even after two weeks. I’ve also visited their support forums. There were 15 posts on the first page, and only 3 posts had comments on them. Most of questions had replies after 1 week from publishing. I’ve sent a message to the Scrapinghub team via the dashboard messaging system which took me a day to get a response from them. They seem to pay more attention to development rather than to support. 🙂
Scrapinghub as a web service for web developers is a good playground to host and run custom scrapers. I wished the service had better documentation and quicker forum support. Paid features (JS-render support (Splash), IP rotation (Crawlera)) can help you assess the full strength of Scrapinghub and enriching the web scraping experience.