Categories
Development

Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Parsing content by XPath takes more content preparation, I think. XPath’s approach (for HTML-XML structures) to parsing is much less time and resource consuming compared to Regex parsing.

If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:
  1. Initialize a DOMDocument class instance from page content (work with HTML as with XML)
  2. Initialize a DOMXPath class instance from DOMDocument class instance.
  3. Parse the DOMXPath object.
1. Initializing a DOMDocument  class instance from page content
  • create a new DOMDocument class instance
$DOM = new DOMDocument;
libxml_use_internal_errors(true);

When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.

  • load the HTML text into the DOMDocument object
if(!$DOM->loadHTML($page))
How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

  • No warning for this case: <p><p><p>
  • For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div>  the library has issued an exception for the DOMXPath ‘query’ method.
The whole scraper listing
$curl = curl_init("http://testing-ground.scraping.pro/blocks");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if (curl_errno($curl)) // check for execution errors
{
    echo "Scraper error: " . curl_error($curl);
    exit;
}
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page)) {
    $errors = "";
    foreach (libxml_get_errors() as $error) {
        $errors.= $error->message . "<br/>";
    }
    libxml_clear_errors();
    print "libxml errors:<br>$errors";
    return;
}
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query('//*[@id=”case1″]')->item(0);
$query = "div[not (@class='ads')]/span[1]";
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo $entry->firstChild->nodeValue;
}

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.