Categories
Web Scraping Software

The present trends in web scraping tools

 

Recently I got a question from one of the blog readers. After I replied to it, I decided to share it with a wider audience.
Question:

Hi,

I found your [web]scraping.pro site and found it very helpful, then realized the web scraper solutions rating was from 2014.  What is the best solution for today?   I have lots of sites I need to scrape, mainly search then drill-down sites.   I would like to be able to schedule the scraping to run on a daily basis.  Is there a direction you could point me?  I’m a seasoned developer by trade but am seeing all these point and click solutions (e.g. import.io) and am wondering if I should stick with Node.JS or .NET or if I should investigate some of these GUI scrapers of today.

Answer:

You are right, the scrapers’ evaluations were done about 5 years ago. Things have changed since then. Today I observe 2 main trends:
  1. The scrapers are moving to the cloud. Scrapers become more like cloud-scraping services providing multi-threading crawling capabilities, storage options, etc. One may run many scraper instances at once (eg. dexi.iocontentgrabber.com) in a cloud infrastructure.
  2. The scraping frameworks get wrapped up with the convenience suites. They are still to be used by developers but extended with all kinds of features: script cloud execution and result cloud storage, scaling on demand and others. E.g. Scrapy framework gets extended as a scrapinghub platform, webrobots.io provide a scraping IDE for JS scraping robots.

The GUI or visual point-&-click scrapers have hit a ceiling in their functionality for the most part (open page, click an item, find similar ones, make a pattern of it, etc.).

The Node.JS, .NET, Python and other tools are still good for regular scraper development. Yet, the largest number of web scraping libraries are of Python.

I recommend to you  WEB SCRAPING TOOLS AND SERVICES LANDSCAPE where you can filter out the scraping tools by features (not all, but most relevant features).

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.