The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.
Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.
XPath is a formal language that is used to navigate through and query elements and attributes in XML documents. While this notation is being used in XSL and XQuery, it is very useful for DOM data access and extraction. XML documents and also HTML/XHTML documents are objects of DOM parsing while using XPath.
Often for the purpose of scraping, one needs to find certain elements’ XPath on a webpage. How can one do that with browser Web developer tools, aka Web inspector? A picture is worth of thousand words.
I recently came across this question in the Q&A section of a forum I belong to:
Sure, if all you want to do is something as lightweight as monitoring a set of target pages for changes, then using a ready monitoring tool is probably way more than you need. You need to keep it simple. So, here’s a quick solution with Google spreadsheet.
I always love a good cheat sheet hanging on my corkboard when I’m working, and XPath is one of the fields where I often refer to it. If you’re looking for a good XPath cheat sheet you will probably find something useful in this post.
Here we’ll show how XPath works. Let’s take the following XML as a lab rat.