Categories
Review SaaS

Scrapinghub review

Scrapinghub is the developer-focused web scraping platform. It provides web scraping tools and services to extract structured information from online sources. The Scrapinghub platform also offers several useful services to collect organized data from the internet. Scrapinghub has four major tools – Scrapy Cloud, Crawlera, and Splash. We’ve decided to try the service. In this post we’ll review its main functionality and also share our experience with Scrapinghub.

Scrapy Cloud

If you have been associated with the web scraping industry (especially in development), you might have heard of Scrapy, the open source data extracting framework. One can create, run, and manage web crawlers with Scrapy easily. For the heavy lifting scraping (e.g. manual server operations, periodic actions, maintenance, etc), Scrapinghub’s Scrapy Cloud automates and visualizes your Scrapy web spiders’ activities.

However, the Scrapy Cloud will limit your ability to scrape data from websites. It has some built-in tools which you can utilize to extract information. If you host Scrapy on your own, you can use the Python based framework to write and run spiders more effectively.

Scrapy Cloud PricingScrapy Cloud Free Plan

Price range for Scrapy Cloud goes from free to $300 per month.

  • The free plan allows you to run only one concurrent crawl (see more to the right).
  • $25 and $50 plans support 4 concurrent crawls. This scales to 8 and 16 concurrent crawls, if you spend $150 or $350 respectively. Additional benefits are provided in higher valued packages.
  • The CPU and RAM options vary from plan to plan. For example, in the $25/mo plan, you get only shared access to the server computer’s RAM. But in the $50/mo plan, you would get 1.3GB of RAM. Each plan gets different amount of resources allocated.
  • The free plan retains your scraped data for 7 days. You can extend this period to 120 days by purchasing any paid plan.

Crawlera

Your spiders may face bans by some web servers during crawling. This situation is frustrating because it hampers data extraction. Scrapinghub’s Crawlera is a solution to the IP ban problem. The service routes your spiders through thousands of different IP addresses. Crawlera has a good collection of IP addresses of more than 50 countries. If a request gets banned from a specific IP, Crawlera executes it from another IP – performing persistently perfectly. Crawlera is able to detect 130+ ban types, server responses, captchas and takes the appropriate action (changing IPs, slowing down operation speed, etc). It functions adaptively to minimize IP bans. The system halts crawling in the worst situation (when the target server continuously rejects crawling requests).

We couldn’t find exactly how many failed extraction attempts would lead to Crawlera giving up, but generally it depends on the overall setup (ex. Splash browser timeout limit, the Scrapy Cloud package’s specifications, etc.)

Crawlera supports both HTTP and HTTPS proxies. The service is available as an addon in Scrapy Cloud, it being a main standalone product. The cost ranges from $99 (Crawlera Basic) up to $500 per month with available negotiable enterprise pricing.

Splash

Splash is another Scrapinghub’s feature. It’s an open source JavaScript rendering service developed by Scrapinghub. Web pages that use JS can be better scraped using the Splash browser. It may process multiple pages in parallel.

Using Splash you can:

  • Process HTML requests
  • Write scripts using Lua programming language – for more customized browsing
  • Take screenshots, etc.

Splash also supports ad blocker rules to accelerate the rendering speed. Splash functions run in a sandbox environment by default. But you can disable these restrictions with a simple command. The default timeout period of Splash browser is 30 seconds. This can cause problems with longer scripts and slower websites. However, this limit can be changed as well.

You may find more details about the headless, scriptable HTTP API browser on this official page. A premium subscription is required to use Splash on Scrapy Cloud.

Scrapinghub Support

Scrapinghub offers several support channels; email, forums, tweets and dashboard based messaging. I’ve mailed them some questions, but support didn’t respond even after two weeks. I’ve also visited their support forums. There were 15 posts on the first page, and only 3 posts had comments on them. Most of questions had replies after 1 week from publishing. I’ve sent a message to the Scrapinghub team via the dashboard messaging system which took me a day to get a response from them. They seem to pay more attention to development rather than to support. 🙂

Conclusion

Scrapinghub as a web service for web developers is a good playground to host and run custom scrapers. I wished the service had better documentation and quicker forum support. Paid features (JS-render support (Splash), IP rotation (Crawlera)) can help you assess the full strength of Scrapinghub and enriching the web scraping experience.

6 replies on “Scrapinghub review”

Great in depth review of Scraping Hub!

I have used this service and I’d like to give my two cents:

PROS:
* Free to try with a decent number of features available
You can quickly evaluate the potential of the various tools. You can test them extensively.
* Portia UI tool is a useful tool espacially for non-python programmer
I ‘m a Java developer. I wanted to test this Python framework. Portia save me here.
* Support
I was on a free plan. The ScrapingHub team has continously answered all my questions.

CONS:
* Portia tends to lag with real life project
Even with Portia 2.0, some project I tried are difficult to create.
* The Spider options are quite hidden
I spent nearly two hours for finding the option for limiting the crawl depth.
* Portia responsiveness
On my laptop I find it sometimes difficult to navigate in the Portia GUI: my screen was simply too small for the GUI.

IMO, use a a large and wide screen for your own comfort in you plan to use Portia. Overall, ScapingHub is a promising platform for developers.

Stephan, thank you for your useful contribution into the review! By the way, do you think you might write a post on the web scraping/{data mining} subjects, posts having tech or review nature?

>do you think you might write a post on the web scraping/{data mining} subjects, posts having tech or review nature?
Sure. Please drop me an email with more details 😉

Something worth mentioning: Splash use is not free with ScrapingHub, it costs at least $20 a month for an instance. So if you need any JS scraping, it’s not for you.

I have spent over a week, as a novice scraper, my spiders were OK’d and runs were either terminated because of no data or simple scrapes went on too long. It was most frustrating because few intelligible hints were issued. In particular, URLs identified on a start page were not linked and subsequent scrapes were empty! Poor Documentation too.

This is regarding Crawlera(mostly) and SH as a whole.
1. Unintuitive , user not friendly, hard to understand billing system.
2. No refunds!!!
3. Slow on popular sites like FB, bing, yelp due to small proxy pool.
4. Poor support.
Right now I m running 2 spiders on bing.
Crawlera with 50 threads has a speed of 40 rpm.
Scraper API with 15 threads – 222 rpm.
Thats all you need to know about crawlera.

P.S. it used to be super cool when free tire provided 24 hours run time ( I guess it was even 48 once) But now, with 1 hour free tire it is good for testing or small projects only.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.