Web parsing php tools

Almost all developers have faced a parsing data task. Needs can be different – from a product catalog to parsing stock pricing. Parsing is a very popular direction in back-end development; there are specialists creating quality parsers and scrapers. Besides, this theme is very interesting and appeals to the tastes of everyone who enjoys web. Today we review php tools used in parsing web content.

Goutte

is a convenient screen scraping library for PHP, which is based on a Symfony framework. Goutte provides an API for crawling and extracting data from different types of responses. Cross functional, reliable and easy – that makes Goutte the best scraping library.

The usage is quite simple:

use Goutte\Client;
// Creating an instance of Client class which will get data from the site. Additionally, we’ll set cookies to the object.
// Setting cookies is pretty easy
$cookie = new CookieJar(true);
$cookie->setCookie(new SetCookie( 'csrf'  => "12345"));
$client = new Client([
   'timeout' => 900,
   'verify' => false,
   'cookies' => $cookie
]);

// Grabbing webpage content. Instead of $uri enter an URL of a target page.
$crawler = $client->request('GET', $uri);

Congratulations. You get data from a web page! Other examples of library possibilities are:

// Get cookie
$cookie = $client->getCookieJar();
// Get DOM element
$crawler->filter('p:contains("Hi!")');
// As Goutte is based on DomCrowler Component, the elements selection is the same as in jQuery. 
// This is how we can get elements with “example” class:
$elements = $crawler->filter('.example')->each(function($res){
    return $res->text();
});PHP:

For connecting to the library, you should add dependencies to the composer.json file. That’s why Goutte should be used with frameworks, for example, Laravel or Yii , not just with simple, single page sites.

htmlSQL

is an engaging library which allows the user to access HTML elements by an SQL syntax. If you love SQL, this experimental library would be the right choice.

//For example, a webpage contains something like 
// <p class = “pcls” color = “red”></p> 

// Simple example how to extract needed info. ‘Color’ is an attribute, ‘p’ is a tag 
include_once("../snoopy.class.php"); 
include_once("../htmlsql.class.php"); 
$wsql = new htmlsql(); $wsql->connect('url', 'http://demo.com'); 

// It’s needed to create an instance of ‘Snoopy’ class to set the cookies 
$snoopy = new snoopy(); 
$snoopy->cookies["name"] = "John"; 
$wsql->query(SELECT * FROM p WHERE 
$class == "pcls"); 
foreach($wsql->fetch_array() as $row){ 
     echo $row['class']; 
     echo $row['color']; 
}

In some cases, it’s more convenient to use SQL instead of CSS-selectors. The library is fast, but has a constrained functionality. This library would be an ideal match for trivial tasks and to parse a web page fast.

Unfortunately, the project was abandoned by its creators in 2006, but htmlSQL is still a reliable helper in parsing and scraping.

Simple HTML DOM

This is a PHP-library that provides HTML parsing by selectors. It processes an invalid HTML and allows the user to parse a page by jquery-like selectors. Let’s look at the example:

// Create a DOM object
$div=str_get_html('<div id="get">Get</div><div id="me">Me</div>');

// Add an attribute: class = “added_ class” to the first div in code.
$div->find('div',1)->class='added_class';

// Change text in div with id = “get” to “simple text”
$div->find('div[id=get]',0)->innertext='simple change';
echo $div;
// As a result script returns to strings a div DOM element with added class
// and a new text in that div with id="get" - 'simple change'

Simple HTML DOM is not as fast as other libraries.

cURL

cURL is one of the most popular libraries (an inbuilt php component) for scraping webpages. As it is a standardized PHP-library, there is no need to include third-party files and classes, but that doesn’t make the cURL library any more convenient than Goutte. The above-mentioned Simple HTML DOM and cURL are making a tandem that is the most popular approach in small data parsing.

Let’s look at the ordinary usage of the library:

function curl_get($url, $referer = "https://www.google.com/") {
 $ch = curl_init();  // Initialising cURL
 curl_setopt($ch, CURLOPT_URL, $url);  // Setting cURL's URL option with the $url variable passed into the function
 curl_setopt($ch, CURLOPT_REFERER, $referer); // from where we come to the site
 curl_setopt($ch, CURLOPT_HEADER, 0);  // No headers
 curl_setopt($ch, CURLOPT_COOKIE, "login=User;password=123test");  // This is how we set cookie.
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  // Setting cURL's option to return the webpage data
 $data = curl_exec($ch);  // Executing the cURL request and assigning the returned data to the $data variable
 curl_close($ch);    // don’t forget about closing cURL
 return $data;   // Returning the data from the function
}

Also, you can set a cookie using this.

Usage

And this how it can be used:

// Scrape a web page from the web
$html = curl_get(“https://somesite.com”);

//  Parse html to the DOM, variable $dom
$dom = str_get_html($html); // here we use SimpleHTML DOM library to create a DOM-structure

// getting elements with class “test”
$elements = $dom->find('.test'); 

// showing the text data of all elements with class “test”
foreach($elements as $element) {
   echo $element->plaintext;

Conclusion

To sum up, the chart below will help you to choose the right library for your needs. For a speed evaluation we used http://symfony.com/blog/ .

	Average speed (seconds)	Convenience in use (Usability)	Special aspects
Goutte	14.3	high	Conveniently works with big projects, OOP, medium speed parsing.
htmlSql	11.3	medium	Fast parsing, but has a limited functionality.
cURL + SimpleHTML DOM	18.6	medium	Works with invalid HTML, slow parsing though.

So, for big-size sites it is better to use Goutte and htmlSql. For the easy task there is no need to include third-party libraries, so it’s better to use cURL along with SimpleHTML DOM for parsing.