Categories
Development Miscellaneous Web Scraping Software

Import.Io Magic Method API

Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues: The scraper only works on pages with more […]

Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:

  • The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
  • It seems to have trouble with some javascript pages.

But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes.

How to form the magic request based on the api_key

After a few attempts and some help from Import.io’s support, I figured out how to form the request to the magic API. I recommend using the php url signing code – it helped me quite a bit.

Leverage regionText for selecting the region (table) to process using a keyword contained within it

The basic Magic (that most people use) only extracts data from what is thinks is the most important list of data on the page. But, suppose I want data from a table that isn’t in the main part of the site. Let’s take the LondonStock web page. To locate a table I use the regionText parameter to process the area that contains a certain phrase. For more on how to use regionText see the docs here. In this example, the regionText phrase would be “ftse 350 share”:

magic_api_table

So here there is the final code for magic method:

function signMagic($urlParameter, $apiKey, $method , $expiry, $regionText, $js) 
{ 
 // extract the userGuid and signing key
 $parts = explode(":", $apiKey);
 $userGuid = $parts[0];
 $key = $parts[1]; 
 $url= "https://api.import.io/store/data/_magic?regionText=" . urlencode($regionText) . "&_mine=true";
 // form the URL to be signed
 $newUrl = $url . '&_user=' . $userGuid . "&_expiry=" . $expiry . "&url=" . $urlParameter . '&js=' . $js;

 // we sign the http verb + the url with the user and expiry information
 $check = $method . " " . $newUrl;

 // use a sha1 hmac algorithm for the digest & base 64 encode it
 $digest = base64_encode(hash_hmac("sha1", $check, base64_decode($key), true));

 // add the digest onto the end of the URL
 return $newUrl . "&_digest=" . urlencode($digest);
}
$uid_api_key='44444444-cccc-bbbb-aaaa-d38786f5e9a3:u6TweWX2kvm8YhnFa4BwwY+ARYiPZGI/GzznDJkJxn9YaLp82bswcO6Mil2+ttKA=='; $regionText = 'ftse 350 share'; 
$js='true';
$signedMagicUrl = signMagic('http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs-multi-currency.html' , $uid_api_key, "GET" , (time() + (60*60*24)) * 1000, $regionText, $js);

Do not forget to urlencode() the regionText parameter and other request parameters.
The signMagic procedure has returned this API url: https://api.import.io/store/data/_magic?regionText=ftse+350+share&_mine=true&_user=4650b976-9796-4830-8778-4822feaf1a28&_expiry=1435075955000&url=http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs-multi-currency.html&js=true&_digest=0f7Z9SNgfJZdLk2Ptm3qFSKR%2Bmc%3D
And for his API url I was able to get this result; online:

magic_api_result

Javascript and infinite scroll

The other extras in the Import.io Magic are javascript and infinite scroll support, both in beta. I’ll be covering more about these features in a later post.

Bonus

If you want to use Magic without the API docs, you can install their Magic bookmark. It functions as a JavaScript code fetching the current URL content and applying Magic extraction to it.

Stay in touch for new magic.

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.