Categories
Web Scraping Software

Import•io: the First Impression

There are two extreme approaches for building a web scraper: to make it highly flexible and customizable but understandable for IT gurus only or to make it nice, simple and handy but limited in usage. All scraping software developers usually try to find a golden mean between these two approaches. In this article I want to introduce you to a relatively new startup, import•io, which says that anyone can scrape any data regardless of his or her IT skills.

Introduction

Here I will show you an overview of how to scrape our testing price list using import•io. Actually, the goal of this post is not to test it against all those extreme cases we use in our test drive, but simply for you to get acquainted with this service. It’s not a detailed tutorial, but after going through it, I hope you will get a workable knowledge of this service.

1. Downloading the client

As you get to their web site, except for all the informational pages, there is only one page where you can do anything useful for your purpose – to download a client:
Import.io Start Page
It’s nice to see that the client is available for all the major operating systems: Windows, MacOS and Linux.

2. Creating an Extractor

The client you downloaded is very similar to an ordinary web browser. To use it, you need to be logged in; if you don’t have an account, you can sign up right from the program. The only thing I don’t love here is that it doesn’t remember your name and password (like other browsers do) and you need to type them in each time.

Import.io Client View
To extract data from a single page, you need to create a so-called Extractor that converts a web page into a data set. The whole process of creating the extractor is highly guided by the system, so that you literally need only to follow the steps shown at the bottom right corner. It’s necessary to note also that there are quite a few tooltips and even tiny videos explaining the steps you’re guided through:

Import.io Building an Extractor

They call this process “training”. This implies that you train the scraper to recognize the information you want to be scraped by pointing it out on the web page. In this case you need first of all to point to the rows, and then to each type of the information in the row: product title, description and price. After you train the scraper you will see the scraped data in the table at the bottom:

Import.io Training

3. Accessing your data

After you are finished with the training, your extractor is created and ready to be uploaded to import•io. And as soon as you put it there, you can access it from your client and extract data from any page similar to the page you have trained your extractor on:

Import.io Data Set

Here you’re also able to download the data, put it into a dashboard to have it visualized on a separate page or even integrate with import•io by means of JavaScript, Java or simple HTTP request.

For example, here is the curl command that retrieves the data from the extractor (this command was automatically formed by import•io itself):

curl -XPOST -H 'Content-Type: application/json' -d '{"input":{"webpage/url":"http://testing-ground.scraping.pro/blocks"}}' "https://api.import.io/store/connector/35038e11-96d0-40e3-9121-f6d7fe2b41f5/_query?_user=14afc61b-e3c7-4380-b484-19d8058dec82&_apikey=???"

Conclusion

It’s obvious that the import•io team did a good job to make this service usable for non-programmers. Even if you know nothing about HTML, HTTP, Regex and XPath you can still build crawlers and extractors using this service. The other important thing is that this service is totally free (at least for now).

Of course, it would be interesting to test import•io with the more difficult cases of web scraping, but probably that is a topic for another article… stay tuned!

3 replies on “Import•io: the First Impression”

Leave a Reply to Jacob laurvigen Cancel reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.