Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Scraper Google Chrome extension

Scraper is a Google Chrome extension. Scraper is a handy scraping tool, perfect for capturing data from web pages and putting it into Google spreadsheets. This tool stands in line with the other scraping software, services and plugins.

About XPath

XPath is a formal language that is used to navigate through and query elements and attributes in XML documents. While this notation is being used in XSL and XQuery, it is very useful for DOM data access and extraction. XML documents and also HTML/XHTML documents are objects of DOM parsing while using XPath.

Find XPath using web developer tools

Often for the purpose of scraping, one needs to find certain elements’ XPath on a webpage. How can one do that with browser Web developer tools, aka Web inspector? A picture is worth of thousand words.

find_xpath_web_dev_tools

 

Simple way HTML change monitoring

html_change_mnitoring_logo1I recently came across this question in the Q&A section of a forum I belong to:

“I want to run once a day a script that will check whether the specific part of code has been changed, and if it did, we would get some return message (ideally directly to my email). What would be the easiest, simplest way to do that? I’ve read about web crawlers, web scrappers, but they seem to be doing far more than we need.”

Sure, if all you want to do is something as lightweight as monitoring a set of target pages for changes, then using a ready monitoring tool is probably way more than you need. You need to keep it simple. So, here’s a quick solution with Google spreadsheet.

5 Best XPath Cheat Sheets and Quick References

XPath Cheat Sheets I always love a good cheat sheet hanging on my corkboard when I’m working, and XPath is one of the fields where I often refer to it. If you’re looking for a good XPath cheat sheet you will probably find something useful in this post.

XPath in Examples

Here we’ll show how XPath works. Let’s take the following XML as a lab rat.

TEST DRIVE: Invalid HTML

Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.

How to leverage Web Scraping for SEO

Eppie Vojt at the SEOmoz Meetup on the scrape leverage for the site SEO. Techniques: XPath and Regex in Google Docs to fetch links and more.