Categories
Development

How to extract emails, phones, links (urls) from text fragments?

Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.

Regex comes to rescue

Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern: 

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Categories
Development

An Independent Review of RegViz (Regex Online Tester)

regviz.org logoRecently I was asked to look at a brand- new online regex tester, regviz.org, developed as a collaboration of VISUS, University of Stuttgart and University of Trier. Though there are a lot of regex online testers on the market today, and many of them are quite good, let’s look at what is special about regviz.org and what it lacks.

Categories
Development

Using Regex Lookaround for HTML element extraction

Yes, I’m aware that using regex for HTML parsing is not the best idea. But still when I need to quickly extract some small portion of a web page I find myself applying regex more often than executing an XPath query, and its lookahead and lookbehind constructions may be quite helpful.

Categories
Development Web Scraping Software

Debuggex Review: get your Regex to play visual

Debuggex is an online Regex testing tool that allows visualization of Regex match algorithms. The visualization feature is good both for the learners who do some Regex exploration and for the experienced users who might want to track the Regex match forward or back. It is also useful for an instant Regex pattern match by highlighting, thus eliminating the need for pressing any buttons to run Regex patterns. This tool is one of a dozen online Regex testers.

Categories
Development

Email validation Regexes

Now we want to review some email validation Regexes. We’ve chosen Regexes based on readability, complexity and RFC standarts relevance. For online Regex testing tools refer here.

Categories
Development

Scraping in PHP with cURL

In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.

The curl  is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.

Categories
SEO and Growth Hacking

How to leverage Web Scraping for SEO

Eppie Vojt at the SEOmoz Meetup on the scrape leverage for the site SEO. Techniques: XPath and Regex in Google Docs to fetch links and more.

Categories
SEO and Growth Hacking

How to leverage Web Scraping for SEO

Eppie Vojt at the SEOmoz Meetup on the scrape leverage for the site SEO. Techniques: XPath and Regex in Google Docs to fetch links and more. The link to the sample Twitter Scraper developed by Eppie Vojt.

Categories
Web Scraping Software

TEST DRIVE: Text list

We’d like to introduce the new SCRAPER TEST DRIVE stage, called ‘Text list‘. This seemingly simple test case hides within itself a non-ordinary structure. This time the HTML DOM structure is so plain, making you scratch your head, wondering how to approach to it. Yet, those off-the-shelf products have shown their best features extracting even a smallest thing from seemingly plain content.