Categories
Development

Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Categories
Review

Scraper Google Chrome extension

Scraper is a Google Chrome extension. Scraper is a handy scraping tool, perfect for capturing data from web pages and putting it into Google spreadsheets. This tool stands in line with the other scraping software, services and plugins.

Categories
Uncategorized

About XPath

XPath is a formal language that is used to navigate through and query elements and attributes in XML documents. While this notation is being used in XSL and XQuery, it is very useful for DOM data access and extraction. XML documents and also HTML/XHTML documents are objects of DOM parsing while using XPath.

Categories
Development

Find XPath using web developer tools

Often for the purpose of scraping, one needs to find certain elements’ XPath on a webpage. How can one do that with browser Web developer tools, aka Web inspector? A picture is worth of thousand words.

find_xpath_web_dev_tools

 

Categories
Miscellaneous

Simple way HTML change monitoring

html_change_mnitoring_logo1I recently came across this question in the Q&A section of a forum I belong to:

“I want to run once a day a script that will check whether the specific part of code has been changed, and if it did, we would get some return message (ideally directly to my email). What would be the easiest, simplest way to do that? I’ve read about web crawlers, web scrappers, but they seem to be doing far more than we need.”

Sure, if all you want to do is something as lightweight as monitoring a set of target pages for changes, then using a ready monitoring tool is probably way more than you need. You need to keep it simple. So, here’s a quick solution with Google spreadsheet.

Categories
Development

5 Best XPath Cheat Sheets and Quick References

XPath Cheat Sheets I always love a good cheat sheet hanging on my corkboard when I’m working, and XPath is one of the fields where I often refer to it. If you’re looking for a good XPath cheat sheet you will probably find something useful in this post.

Categories
Development

XPath in Examples

Here we’ll show how XPath works. Let’s take the following XML as a lab rat.

Categories
Web Scraping Software

TEST DRIVE: Invalid HTML

Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.

Categories
SEO and Growth Hacking

How to leverage Web Scraping for SEO

Eppie Vojt at the SEOmoz Meetup on the scrape leverage for the site SEO. Techniques: XPath and Regex in Google Docs to fetch links and more.