Categories
Web Scraping Software

Dexi.io – how to improve performance

Intro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it.

Categories
Guest posting Web Scraping Software

UiPath PDF Data Extraction

UiPath, one of the big providers of robotic process automation software, has some very interesting positioning. Unlike the other players on the market, they provide a free and fully featured community edition of their product for anybody to test and develop. The tool automates any application and is packed with all the web scraping and screen scraping capabilities for both desktop and web.  The platform also has a lively community forum featuring jobs, automation contests and knowledge-sharing between UiPath users: www.forum.uipath.com.

Categories
Web Scraping Software

Mozenda web scraping and publishing of data to cloud storage

Mozenda is a cloud web scraping service (SaaS), and we’ve already reviewed it. Since our last review, Mozenda has provided more useful utility features for data extraction. Besides multi-threaded extraction & smart data aggregation, Mozenda allows users to publish extracted data to cloud storage such as Dropbox, Amazon, and Microsoft Azure. In this post we will try to explain the new Mozenda extraction and integration capabilities.

Categories
Development Web Scraping Software

The worthy alternative to dissolving scraping Kimono API

Recently I got notified of Kimono service finishing its work due to kimono team being joining another project. So many data hunters who were using this prominent free API service are now in search for a good alternative. 

Categories
Development Web Scraping Software

Dexi.io REST API in php (example)

In this post, I’d like to demonstrate how to leverage the Dexi.io (CloudScrape) API along with its PHP Client library (also avail in Ruby and C#).

Categories
Web Scraping Software

A tool to extract phone numbers from a list of URLs

Today I got a question from one of my readers asking if there is a good out-of-the-box solution for crawling multiple websites for contact information. 

Categories
Development Web Scraping Software

Content Grabber with free proxy account integration for business directories scrape

Professional data extraction requires adequate proxying to keep anonymity of scraping robots. When attempting to extract large data sets (over 1M records, ex. business directories) reliable and fast proxy service is needed.

Sequentum has released the Nohodo proxy service integration for Content Grabber. Nohodo provides a free account for Content Grabber users (up to 5000 requests monthly for free). The feature is available for both trial users and regular customers. Here’s how it works…

Categories
Featured Web Scraping Software

Dexi.io Review

dexi-medium-height-130pxDexi.io is a powerful scraping suite. This cloud scraping service provides development, hosting and scheduling tools. The suite might be compared with Mozenda for making web scraping projects and runnig them in clouds for user convenience. Yet it includes the API, each scraper being a json definition similar to other services like import.io, kimono lab and parseHub.

Categories
Development Miscellaneous Web Scraping Software

Import.Io Magic Method API

Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:

  • The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
  • It seems to have trouble with some javascript pages.

But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes.

Categories
Guest posting Web Scraping Software

Turn any interactive website into an API with ParseHub

parsehubAnyone should be able to pull data from the web and access it in the format they want. If a website does not have an API available, scraping is one of the only options to get the data you need. But figuring out how to scrape data in the complicated HTML is a pain.

ParseHub is a new web browser extension that you can use to turn any dynamic and poorly structured website into an API, without writing code. ParseHub is a scraping tool that is designed to work on websites with JavaScript and Ajax; it is similar to web scraping tools such as Import.io and Kimono Labs.