There are two extreme approaches for building a web scraper: to make it highly flexible and customizable but understandable for IT gurus only or to make it nice, simple and handy but limited in usage. All scraping software developers usually try to find a golden mean between these two approaches. In this article I want to introduce you to a relatively new startup, import•io, which says that anyone can scrape any data regardless of his or her IT skills.
Introduction
Here I will show you an overview of how to scrape our testing price list using import•io. Actually, the goal of this post is not to test it against all those extreme cases we use in our test drive, but simply for you to get acquainted with this service. It’s not a detailed tutorial, but after going through it, I hope you will get a workable knowledge of this service.
1. Downloading the client
As you get to their web site, except for all the informational pages, there is only one page where you can do anything useful for your purpose – to download a client:
It’s nice to see that the client is available for all the major operating systems: Windows, MacOS and Linux.
2. Creating an Extractor
The client you downloaded is very similar to an ordinary web browser. To use it, you need to be logged in; if you don’t have an account, you can sign up right from the program. The only thing I don’t love here is that it doesn’t remember your name and password (like other browsers do) and you need to type them in each time.
To extract data from a single page, you need to create a so-called Extractor that converts a web page into a data set. The whole process of creating the extractor is highly guided by the system, so that you literally need only to follow the steps shown at the bottom right corner. It’s necessary to note also that there are quite a few tooltips and even tiny videos explaining the steps you’re guided through:
They call this process “training”. This implies that you train the scraper to recognize the information you want to be scraped by pointing it out on the web page. In this case you need first of all to point to the rows, and then to each type of the information in the row: product title, description and price. After you train the scraper you will see the scraped data in the table at the bottom:
3. Accessing your data
After you are finished with the training, your extractor is created and ready to be uploaded to import•io. And as soon as you put it there, you can access it from your client and extract data from any page similar to the page you have trained your extractor on:
Here you’re also able to download the data, put it into a dashboard to have it visualized on a separate page or even integrate with import•io by means of JavaScript, Java or simple HTTP request.
For example, here is the curl command that retrieves the data from the extractor (this command was automatically formed by import•io itself):
curl -XPOST -H 'Content-Type: application/json' -d '{"input":{"webpage/url":"http://testing-ground.scraping.pro/blocks"}}' "https://api.import.io/store/connector/35038e11-96d0-40e3-9121-f6d7fe2b41f5/_query?_user=14afc61b-e3c7-4380-b484-19d8058dec82&_apikey=???"
Conclusion
It’s obvious that the import•io team did a good job to make this service usable for non-programmers. Even if you know nothing about HTML, HTTP, Regex and XPath you can still build crawlers and extractors using this service. The other important thing is that this service is totally free (at least for now).
Of course, it would be interesting to test import•io with the more difficult cases of web scraping, but probably that is a topic for another article… stay tuned!
3 replies on “Import•io: the First Impression”
i have tried import.io ,it’s awesome ,.. But need more improvement for the CRAWLER feature. I stoped on step 6th on this tutorial (http://support.import.io/knowledgebase/articles/247570-create-a-crawler). The browser can’t load the Next page to scrap the data. I don’t know why, i try to contact the support team , But they suggest me to try CONNECTOR feature. But overall,..I like this software,..very userfrindly for NON-IT, like me,..
Hi Jono.
Then you should try out dexi.io, the leading (by far) web data solution that can do just that AND sooo much more. Try it out for free and use the live support for any questions!
Best, Jacob
Aaaand the link 🙂 https://dexi.io/