Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

An Independent Review of RegViz (Regex Online Tester)

regviz.org logoRecently I was asked to look at a brand- new online regex tester, regviz.org, developed as a collaboration of VISUS, University of Stuttgart and University of Trier. Though there are a lot of regex online testers on the market today, and many of them are quite good, let’s look at what is special about regviz.org and what it lacks.

Using Regex Lookaround for HTML element extraction

Yes, I’m aware that using regex for HTML parsing is not the best idea. But still when I need to quickly extract some small portion of a web page I find myself applying regex more often than executing an XPath query, and its lookahead and lookbehind constructions may be quite helpful.

Debuggex Review: get your Regex to play visual

Debuggex is an online Regex testing tool that allows visualization of Regex match algorithms. The visualization feature is good both for the learners who do some Regex exploration and for the experienced users who might want to track the Regex match forward or back. It is also useful for an instant Regex pattern match by highlighting, thus eliminating the need for pressing any buttons to run Regex patterns. This tool is one of a dozen online Regex testers.

Email validation Regexes

Now we want to review some email validation Regexes. We’ve chosen Regexes based on readability, complexity and RFC standarts relevance. For online Regex testing tools refer here.

Scraping in PHP with cURL

In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.

The curl  is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.

How to leverage Web Scraping for SEO

Eppie Vojt at the SEOmoz Meetup on the scrape leverage for the site SEO. Techniques: XPath and Regex in Google Docs to fetch links and more. The link to the sample Twitter Scraper developed by Eppie Vojt.

TEST DRIVE: Text list

We’d like to introduce the new SCRAPER TEST DRIVE stage, called ‘Text list‘. This seemingly simple test case hides within itself a non-ordinary structure. This time the HTML DOM structure is so plain, making you scratch your head, wondering how to approach to it. Yet, those off-the-shelf products have shown their best features extracting even a smallest thing from seemingly plain content.

Data Mining with Google Refine

Google Refine is a free tool for data processing, it standing in line with some other free Google data analysis tools. Because of its close association with web scraping, we want to shed some light on it.