Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:
- The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
- It seems to have trouble with some javascript pages.
But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes.
How to form the magic request based on the api_key
After a few attempts and some help from Import.io’s support, I figured out how to form the request to the magic API. I recommend using the php url signing code – it helped me quite a bit.
Leverage regionText for selecting the region (table) to process using a keyword contained within it
The basic Magic (that most people use) only extracts data from what is thinks is the most important list of data on the page. But, suppose I want data from a table that isn’t in the main part of the site. Let’s take the LondonStock web page. To locate a table I use the regionText parameter to process the area that contains a certain phrase. For more on how to use regionText see the docs here. In this example, the regionText phrase would be “ftse 350 share”:
So here there is the final code for magic method:
function signMagic($urlParameter, $apiKey, $method , $expiry, $regionText, $js) { // extract the userGuid and signing key $parts = explode(":", $apiKey); $userGuid = $parts[0]; $key = $parts[1]; $url= "https://api.import.io/store/data/_magic?regionText=" . urlencode($regionText) . "&_mine=true"; // form the URL to be signed $newUrl = $url . '&_user=' . $userGuid . "&_expiry=" . $expiry . "&url=" . $urlParameter . '&js=' . $js; // we sign the http verb + the url with the user and expiry information $check = $method . " " . $newUrl; // use a sha1 hmac algorithm for the digest & base 64 encode it $digest = base64_encode(hash_hmac("sha1", $check, base64_decode($key), true)); // add the digest onto the end of the URL return $newUrl . "&_digest=" . urlencode($digest); } $uid_api_key='44444444-cccc-bbbb-aaaa-d38786f5e9a3:u6TweWX2kvm8YhnFa4BwwY+ARYiPZGI/GzznDJkJxn9YaLp82bswcO6Mil2+ttKA=='; $regionText = 'ftse 350 share'; $js='true'; $signedMagicUrl = signMagic('http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs-multi-currency.html' , $uid_api_key, "GET" , (time() + (60*60*24)) * 1000, $regionText, $js);
Do not forget to urlencode() the regionText parameter and other request parameters.
The signMagic procedure has returned this API url: https://api.import.io/store/data/_magic?regionText=ftse+350+share&_mine=true&_user=4650b976-9796-4830-8778-4822feaf1a28&_expiry=1435075955000&url=http://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs-multi-currency.html&js=true&_digest=0f7Z9SNgfJZdLk2Ptm3qFSKR%2Bmc%3D
And for his API url I was able to get this result; online:
Javascript and infinite scroll
The other extras in the Import.io Magic are javascript and infinite scroll support, both in beta. I’ll be covering more about these features in a later post.
Bonus
If you want to use Magic without the API docs, you can install their Magic bookmark. It functions as a JavaScript code fetching the current URL content and applying Magic extraction to it.
Stay in touch for new magic.