Categories
Development

Make crawling easy with Real Time Crawler of Oxylabs.io

logo-oxylabs-ioNowadays, it’s hard to imagine our life without search systems. “If you don’t know something, google it!” –  is one of the most popular maxims in our life. But how many people use Google in an optimal way? A lot of developers use google commands to get needed answers as fast as it possible.

Even this is not enough today! Large and small companies need terabytes of data to make their business profitable. It’s necessary to automate the search process and make it reliable to satisfy the user with fresh news, updates or posts. In today’s article we will consider a very helpful tool – Real-Time Crawler (RTC) for the collection of fresh data. Let’s start!

How does Real-Time Crawler work?

The work model is very simple. The user makes a request for getting needed data via a crawler. The crawler receives this request and tries to access data by itself. After a successful reading, the crawler posts data back to the user. The work of RTC is illustrated here:

Let’s see how it works in practice! On the Oxylabs’ website there is a demo app to play with the crawler. We can use it to extract data about Amazon’s products. All we need is to enter the ASIN (product’s unique identifier) and click the button. Product info will be returned in JSON format.

Now let’s make our crawler search for hotels in Paris.

Searching with RTC

To reach today’s goal we will use CLI and PHP. And the first thing we need to do is to send a request. To make a request we will create a little php app. Here is the code:

<?php 
$params = array(
 'source' => 'google_hotels',
 'domain' => 'com',
 'query' => 'hotels in Paris',
 'pages' => 3,
 'context' => [
   ['key' => 'hotel_occupancy',
    'value' => 1,
   ],
   ['key' => 'hotel_dates',
    'value'=> '2019-02-05,2019-02-05',
   ]
 ]
);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "username" . ":" . "password");

$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
  echo 'Error:' . curl_error($ch);
}
curl_close($ch);

We form the structure based on our demands for searching – payload ($params). This is the main part of our little app. Here you can get acquainted with all the parameters of payload. In the above code, we send a payload to the crawler and save the result data to a response variable.

Before we continue, let’s check how fast RTC is. We will make the task harder – we will make it get not three pages of results, but ten.

$start = microtime(true);
//our code…
$time = microtime(true) - $start

And we see that $time equals 0.677 ms. Pretty fast for fetching 10 search result pages, isn’t it?

RTC response
Go on. What do we have in response?

{'_links': 
   [{'href': 'https://data.oxylabs.io/v1/queries/6481944577447037953',
      'method': 'GET',
      'rel': 'self'},
   {'href': 'https://data.oxylabs.io/v1/queries/6481944577447037953/results',
      'method': 'GET',
      'rel': 'results'}],
'client_id': 385,
'context': [{'key': 'results_language', 'value': None},
            {'key': 'safe_search', 'value': None},
            {'key': 'tbm', 'value': None},
            {'key': 'cr', 'value': None}],
'created_at': '2018-12-21 18:13:35',
'domain': 'com',
'geo_location': None,
'id': '6481944577447037953',
'limit': 10,
'locale': None,
'pages': 3,
'parse': False,
'query': 'hotels in Paris',
'render': None,
'source': 'google_search',
'start_page': 1,
'status': 'pending',
'subdomain': 'www',
'updated_at': '2018-12-21 18:13:35',
'user_agent_type': 'desktop'
}

On the 5th line we can find the result URL – it’s a URL where we can find needed hotels (html pagegs) that are stored at the oxylabs.io data service. Html pages are to be delivered to end user in JSON format. All we need is to extract data from the result URL. It could be done with the CLI and cURL or with a Postman.

Let’s do it with a cURL. This simple command will help us to get the result data from our crawler.

curl  –user username:password https://data.oxylabs.io/v1/queries/6481466889326300161/results

The result is a SERP (search result page) in JSON format. It has a HTML-code inside JSON. So, we need to parse JSON to see the result page.

For this, let’s download JSON to the separate file and move it to a PHP-project.

curl  –user username:password -o source.txt https://data.oxylabs.io/v1/queries/6484117435057179649/results

Note: You may get an SSL certificate error. To solve it, add a -k parameter after curl

Parse JSON in result

We are almost at the destination point! Now, we will create a PHP project to extract needed data from the txt file and get results. As you remember, the resulting data is extracted in JSON format. All we need is to move the txt file to the PHP-project folder and decode it:

<?php
$json = json_decode(file_get_contents('source.txt'));
$results = $json->results;
print_r($results[0]);

And here is the SERP:

Great, we did it! As you can see, the real-time web-crawler is a very convenient tool which can help to make your business more profitable. All you need is to write a few lines of code – and you will own the information, which means that you will always be one step ahead of your rivals.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.