Categories
Development

Remove empty html tags recursively

Sometimes we have the code with html tags that contain nothing but whitespace characters. Often those tags are nested. See a code below:

<div> 
 <div> 
  <div></div> 
 </div> 
</div>

What regex might be used to find and remove those tags?

Obvious solution is <div>\s*?<\/div> .

\s stands for “whitespace character”. It includes [ \t\n\x0B\f\r]. That is: \s matches a space( ) or a tab (\t) or a line(\n) break or a vertical tab (\x0B) sometimes referred as (\v) or a form feed (\f) or a carriage return (\r) .

General case

In general case, we use the following regex:
<(?<tag>[a-z]+?)( [^>]+?|)>\s*?<\/(\k<tag>)>

where <tag> is a named match group: [a-z]+?

JAVA code

When applying it recursively we might use the following code, JAVA:

public static String removeEmptyTags(String html) {
        boolean compareFound = true;
        Pattern pattern = Pattern.compile("<(?<tag>[a-z]+?)( [^>]+?|)>\\s*?</(\\k<tag>)>", Pattern.MULTILINE | Pattern.DOTALL);
        while (compareFound) {
            compareFound = false;
            Matcher matcher = pattern.matcher(html);
            if(matcher.find()) {
                compareFound = true;
                html = matcher.replaceAll("");
            }
        }
        return html;
    }
Categories
Development

JavaScript, Regex match groups

Often we want only a certain info from the matched content. So, groups help with that.

The following example shows how to fetch the [duplicate] entry index from the error message. For that we take 1st group, index “1”:

const regex = /Duplicate entry\s'([^']+)+'/gm;
const str = `{ Error: (conn=42434, no: 1062, SQLState: 23000) Duplicate entry '135' for key 'PRIMARY'
sql: INSERT INTO \`test\` (id , string1) values (?,?) - parameters:[[135,'string 756']]
    at Object.module.exports.createError (C:\\Users\\User\\Documents\\RnD\\Node.js\\mercateo-`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}
Categories
Development

How to extract emails, phones, links (urls) from text fragments?

Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.

Regex comes to rescue

Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern: 

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Categories
Development

An Independent Review of RegViz (Regex Online Tester)

regviz.org logoRecently I was asked to look at a brand- new online regex tester, regviz.org, developed as a collaboration of VISUS, University of Stuttgart and University of Trier. Though there are a lot of regex online testers on the market today, and many of them are quite good, let’s look at what is special about regviz.org and what it lacks.

Categories
Development

Using Regex Lookaround for HTML element extraction

Yes, I’m aware that using regex for HTML parsing is not the best idea. But still when I need to quickly extract some small portion of a web page I find myself applying regex more often than executing an XPath query, and its lookahead and lookbehind constructions may be quite helpful.

Categories
Development Web Scraping Software

Debuggex Review: get your Regex to play visual

Debuggex is an online Regex testing tool that allows visualization of Regex match algorithms. The visualization feature is good both for the learners who do some Regex exploration and for the experienced users who might want to track the Regex match forward or back. It is also useful for an instant Regex pattern match by highlighting, thus eliminating the need for pressing any buttons to run Regex patterns. This tool is one of a dozen online Regex testers.

Categories
Development

Email validation Regexes

Now we want to review some email validation Regexes. We’ve chosen Regexes based on readability, complexity and RFC standarts relevance. For online Regex testing tools refer here.

Categories
Development

Scraping in PHP with cURL

In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.

The curl  is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.

Categories
SEO and Growth Hacking

How to leverage Web Scraping for SEO

Eppie Vojt at the SEOmoz Meetup on the scrape leverage for the site SEO. Techniques: XPath and Regex in Google Docs to fetch links and more.