The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.
Parsing content by XPath takes more content preparation, I think. XPath’s approach (for HTML-XML structures) to parsing is much less time and resource consuming compared to Regex parsing.
If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.
- Initialize a DOMDocument class instance from page content (work with HTML as with XML)
- Initialize a DOMXPath class instance from DOMDocument class instance.
- Parse the DOMXPath object.
1. Initializing a DOMDocument class instance from page content
- create a new DOMDocument class instance
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.
- load the HTML text into the DOMDocument object
if(!$DOM->loadHTML($page))
- enable user error handling
{
$errors="";
foreach(libxml_get_errors() as $error){
$errors.=$error–>message . "<br />";
}
libxml_clear_errors();
print "libxml errors:<br />$errors";
return;
}
Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.
2. Initializing a DOMXPath object from the DOMDocument object
- Initialize DOMXPath object for further parse
$xpath = new DOMXPath($DOM);
Now XPath methods are applicable to the content
Parsing the DOMXPath object
As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.
$case1 = $xpath->query('//*[@id="case1"]')->item(0);
$query = 'div[not (@class="ads")]/span[1]';
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry){
echo $entry->firstChild->nodeValue;
}
How libxml library reacts to a malformed HTML
The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:
- No warning for this case: <p><p><p>
- For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div> the library has issued an exception for the DOMXPath ‘query’ method.
The whole scraper listing
$curl = curl_init("http://testing-ground.scraping.pro/blocks");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if (curl_errno($curl)) // check for execution errors
{
echo "Scraper error: " . curl_error($curl);
exit;
}
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page)) {
$errors = "";
foreach (libxml_get_errors() as $error) {
$errors.= $error->message . "<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query('//*[@id=”case1″]')->item(0);
$query = "div[not (@class='ads')]/span[1]";
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
echo $entry->firstChild->nodeValue;
}