Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Parsing content by XPath takes more content preparation, I think. XPath’s approach (for HTML-XML structures) to parsing is much less time and resource consuming compared to Regex parsing.

If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:

Initialize a DOMDocument class instance from page content (work with HTML as with XML)
Initialize a DOMXPath class instance from DOMDocument class instance.
Parse the DOMXPath object.

1. Initializing a DOMDocument class instance from page content

create a new DOMDocument class instance

$DOM = new DOMDocument;

disable standard libxml errors

libxml_use_internal_errors(true);

When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.

load the HTML text into the DOMDocument object

if(!$DOM->loadHTML($page))

enable user error handling

{
    $errors="";
    foreach(libxml_get_errors() as $error){
        $errors.=$error–>message . "<br />";
    }
    libxml_clear_errors();
    print "libxml errors:<br />$errors";
    return;
}

Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.

2. Initializing a DOMXPath object from the DOMDocument object

Initialize DOMXPath object for further parse

$xpath = new DOMXPath($DOM);

Now XPath methods are applicable to the content

Parsing the DOMXPath object

As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.

$case1 = $xpath->query('//*[@id="case1"]')->item(0);
$query = 'div[not (@class="ads")]/span[1]';
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry){
    echo  $entry->firstChild->nodeValue;
}

How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

No warning for this case: <p><p><p>
For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div> the library has issued an exception for the DOMXPath ‘query’ method.

The whole scraper listing

$curl = curl_init("http://testing-ground.scraping.pro/blocks");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if (curl_errno($curl)) // check for execution errors
{
    echo "Scraper error: " . curl_error($curl);
    exit;
}
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page)) {
    $errors = "";
    foreach (libxml_get_errors() as $error) {
        $errors.= $error->message . "<br/>";
    }
    libxml_clear_errors();
    print "libxml errors:<br>$errors";
    return;
}
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query('//*[@id=”case1″]')->item(0);
$query = "div[not (@class='ads')]/span[1]";
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo $entry->firstChild->nodeValue;
}

1. Initializing a DOMDocument class instance from page content

2. Initializing a DOMXPath object from the DOMDocument object

Parsing the DOMXPath object

How libxml library reacts to a malformed HTML

The whole scraper listing

Leave a Reply Cancel reply