Mozenda is a cloud web scraping service (SaaS), and we’ve already reviewed it. Since our last review, Mozenda has provided more useful utility features for data extraction. Besides multi-threaded extraction & smart data aggregation, Mozenda allows users to publish extracted data to cloud storage such as Dropbox, Amazon, and Microsoft Azure. In this post we will try to explain the new Mozenda extraction and integration capabilities.
In this post we want to show you the code for an automatic connection to 2captcha service for solving google reCaptcha v2.0. Not long ago, google drastically complicated the user-behavior reCaptcha (v2.0). This online service provides a method for solving it.
Octoparse is a new modern visual web data extraction software. It provides users a point-&-click UI to develop extraction patterns, so that scrapers can apply these patterns to structured websites. Both experienced and inexperienced users find it easy to use Octoparse to bulk extract information from websites – for most of scraping tasks no coding needed!
We’ve already written about suitable proxy servers for web scraping. Now we want to focus our readers on those for the huge/mass quantities data records scrape, particulary from the business directories. When scraping business directories, their web servers can identify repetitive requesting and put you on hold by looking at the IP address that is used for frequent http requests. Proxy rotation web service is the means for repeatedly changing IP address. Thus, target web server can only see the random IP addresses from rotating proxies pool at each request.
Recently I got notified of Kimono service finishing its work due to kimono team being joining another project. So many data hunters who were using this prominent free API service are now in search for a good alternative.
Dexi.io is a powerful scraping suite. This cloud scraping service provides development, hosting and scheduling tools. The suite might be compared with Mozenda for making web scraping projects and runnig them in clouds for user convenience. Yet it includes the API, each scraper being a json definition similar to other services like import.io, kimono lab and parseHub.
From time to time, web users struggle with “CAPTCHA services” such as DeCaptcher and DBC. And although those services are reliable, often times they’re “overloaded”, meaning the images to be solved get rejected or it takes a lot of time to be decoded (some services might even take 50 seconds to solve a single image!).
But, I recently came across a new service that hopes to fill this (fast CAPTCHA solving) gap. EndCaptcha.com, is a new image digitization service that was built to satisfy the needs of the most demanding consumers. It uses a dedicated team of operators assisted by a smart OCR system. That’s why it’s being considered a Premium CAPTCHA service.
Most scraping solutions fall into two categories: Visual scraping platforms targeted at non-programmers ( Content Grabber, Dexi.io, Import.io, etc.), and scraping code libraries like Scrapy or PhantomJS which require at least some knowledge of how to code.
Web Robots builds scraping IDE that fills the gap in between. Code is not hidden but instead made simple to create, run and debug.
Recently, import.io (a free scraping online tool) announced that they are adding another way to get data from the web: they’ll build it for you. This new “Data as a Service” program is targeted at businesses and organizations who need data, but don’t have the time or resources to devote to using the import.io tool to build it themselves. For these clients, import will curate custom datasets based on their specific requirements as well as develop custom data implementation solutions based on the organization’s in-house software.