Recently I got a question from one of the blog readers. After I replied to it, I decided to share it with a wider audience.
Question:
Hi,
I found your [web]scraping.pro site and found it very helpful, then realized the web scraper solutions rating was from 2014. What is the best solution for today? I have lots of sites I need to scrape, mainly search then drill-down sites. I would like to be able to schedule the scraping to run on a daily basis. Is there a direction you could point me? I’m a seasoned developer by trade but am seeing all these point and click solutions (e.g. import.io) and am wondering if I should stick with Node.JS or .NET or if I should investigate some of these GUI scrapers of today.
Answer:
You are right, the scrapers’ evaluations were done about 5 years ago. Things have changed since then. Today I observe 2 main trends:
- The scrapers are moving to the cloud. Scrapers become more like cloud-scraping services providing multi-threading crawling capabilities, storage options, etc. One may run many scraper instances at once (eg. dexi.io, contentgrabber.com) in a cloud infrastructure.
- The scraping frameworks get wrapped up with the convenience suites. They are still to be used by developers but extended with all kinds of features: script cloud execution and result cloud storage, scaling on demand and others. E.g. Scrapy framework gets extended as a scrapinghub platform, webrobots.io provide a scraping IDE for JS scraping robots.
The GUI or visual point-&-click scrapers have hit a ceiling in their functionality for the most part (open page, click an item, find similar ones, make a pattern of it, etc.).
The Node.JS, .NET, Python and other tools are still good for regular scraper development. Yet, the largest number of web scraping libraries are of Python.
I recommend to you WEB SCRAPING TOOLS AND SERVICES LANDSCAPE where you can filter out the scraping tools by features (not all, but most relevant features).