Categories
Development

Regular expressions (Regex)

Regular expressions provide a concise and flexible means to “match” (specify and capture) strings of text, such as particular characters, words, or patterns of characters. Here we tried our best to present to you the most used Regexes with examples for your handy referencing.

Introduction

Regexes or Regular Expressions is a formal language for setting string patterns for complex searches. Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns.

There are many different dialects of regular expressions, all slightly different. Therefore, when referring to, include the tag for the specific programming language or tool (e.g., Perl, Ruby, Python, Java, JavaScript, vi, Emacs, sed, Lex, grep, etc.) you are using. It’s probably fair to say that Perl has the most robust regex engine in common use, so the Regex for Perl is most well spread and searched. The parse engines provide powerful and flexible means providing also extended Regex application, for example backreferencing, substittions, POSIX character classes and other. The online regex testers comparison table of different engines is available as well.
The modern Web Scraping is more or less wrapped up with Regexes, so for the developers’ sake we put out the Regex Cheat Sheet with examples. This would be a handy tool for your daily use.

Daily Regexes

Character classes Metacharacters Anchors
Groups and ranges Substitutions Quantifiers
Quantifiers modifier Assertions POSIX
Pattern modifiers Conditions & Comments

Character classes
Character Description Example
Control character \cx matches ‘Ctrl+x’
\s White space new\sgame matches new game
\S Not white space \S+ matches cool in cool
\d Digit \d matches 1 3 5 in 1st 3d 5th
\D Not digit \D+ matches  Harley road in 2500 Harley road
\w Matches any word character and underscore \w matches A 1 2 9 6 _ in A-12-96_
\W Matches any non-word character \W matches - - , in A-12-96,

Metacharacters and Escape sign
Character Description Example
.*+?^&$()[]{}\|/ Characters that go for special meaning unless backslashed ‘\’ \^, \$, \\ match  ^, $, \

\
Escape character (backslash) changes letters for their special meaning (see also ‘Character classes’). It also escapes metacharacters to suppress their special meaning (see the cell above). \a matches Bell, hex07;
\r matches Carriage Return, hex0D;
\n matches New Line, hex0A;
\t matches Tab, hex09;
\v matches Vertical Tab, hex0B;
\f matches Form Feed, hex0C;
\ddd matches ASCII octal code ddd;
\xhh matches ASCII hex code hh;
\uxxxx matches Unicode character expressed in hexadecimal notation with exactly four numeric digits xxxx

Anchors (match position)
^ Start of string ^a matches only first a in abcadaef.
$ End of string .$ matches f in abcdef
\A Start of string. Never matches after line breaks. \A. matches only a in abc dfg hij
\Z \z End of string. If ‘\Z’, then also before new line at the end .\Z matches f in abcdef
\b String boundary .\b matches c in abc; \b. matches a in abc
\B Not string boundary \B.*\B matches bcd defg hi in abcd defg hij
\G The match must occur at the point where the previous match ended  \G\(\d\)  matches (1)(3) in (1)(3)[4](5)

Groups and Ranges
Character Description Example
. (dot) Any character except new line (\n) a.c matches abc, asc or adc
| (pipe) Logical ‘OR’. abc(def|xyz) matches abcdef or abcxyz
( ) Group (active) for capture and counts (abc) matches abc in 34abcdef\t
(?: ) Passive Group (non-capturing) \d(?:abc) captures 4abc in 34abcdef but abc being a passive capture group, while 4 is in active one.
[  ] Range  [a-zA-Z0-9] matches any letter or digit
[^  ] Negative (exclusive) Range  [^a-d4-6] matches any character except a, b, c or d, or 4,5 or 6
(  )\n Numbered group ( ) starting with 1; \n is an instance of nth group to be present in search string  (abc)\1\s(ghi)\1\2 matches abcabc ghiabcghi; where abc and ghi are numbered as 1st and 2nd capture groups respectively.
(?<name>  ) Named group to be used as backreference (?<Day>\d{1,2}) matches 12 in on June 12 and the match named as Day capture group.

Substitutions (Backreferences)
Character Description Example
$n Substitutes the substring matched by group number n Applying  ^(\w+)\s(\w+)$ to Mary Seay with $2, $1 as a replace pattern results in Seay, Mary
${name} Substitutes the substring matched by the named group name.** Applying (?<word1>\w+)\s(?<word2>\w+) to ABC DEF with ${word2} ${word1} as a replace pattern results in DEF ABC
$` Substitutes all the text of the input string before the matched string Applying  While to Kris While with $` as a replace pattern results in Kris Kris
$' Substitutes all the text of the input string after the matched string Applying  3+ to 1122334455 with $' as a replace pattern results in 112244554455
$+ Substitutes the last group that was captured Applying  (\d+)(USD)? to 200 USD with $$$1 as a replace pattern results in $200
$_ Substitutes the entire input string Applying  regex to New regex Code with $_ as a replace pattern results in New New regex Code Code
$& Substitutes a copy of the whole match Applying  (\$*(\d*(\.+\d+)?){1}) to 1.4590 with ***$& as a replace pattern results in ***$1.4590***
$$ Substitutes a $ character Applying  (\d+)\s*(USD)? to 200 USD with $$$1 as a replace pattern results in $200
**Not all Regex libraries support the named group substitutions

Quantifiers
Character Description Example
? 0 or 1 times ab(c?) matches ab or abc
* 0 or more times ".*?" matches "def""g" and "" in abc"def" "g" "" jkl (here asteric is not greedy because of a ? following it)
+ 1 or more times B+ matches BB and B in aaaBBccccBddd
{n} Exactly n times a{3} matches aaa
{n,m} Between n and m times. Greedy if not restricted. a{2,4} matches only aaaa in aaaaa
{n,} At least n times. Greedy. a{2,} matches aaaaa in aaaaa … Greedy!
Quantifier Modifier
? (if after a quantifier) Makes a quantifier not greedy, rather lazy. be+? matches be in been (not bee)

Assertions
Character Description Example
?= Zero-width lookahead assertion \w+(?=\.) matches sleep, ill and Oops in We sleep. The man is ill. Oops.!
?! Zero-width negative lookahead assertion \b\w+(?!\.)\b matches We, The, man and is in We sleep. The man is ill. Oops.!
?<= Zero-width lookbehind assertion (?<=DDR2)\s+\w+ matches 2MB in DDR2 2MB
?<!= Zero-width negative lookbehind assertion (?<!=20)\d{2}\b matches 45, 76 in 1945 2012 1876 2002

Conditions and comment
Character Description Example
(?(expression)
yes | no )
Matches yes if expression matches; otherwise, matches the optional no part. (?(A)A\d{3}\b|\b\d{3}\b) matches A380, 747, 400 in A380 C103 747-400
(?( name )
yes | no )
Matches yes if the name capture has a match; otherwise, matches the optional no. (?")?(?(quot).+?"|\S+\s) matches rock.mp3 and "new song.mp3" in rock.jpg "new song.jpg"
(?# ) Comment text in brackets (?# a new regex :-))

POSIX
Common range Description Notes
[:upper:] Upper case letters  [A-Z]+
[:lower:] Lower case letters  [a-z]
[:alpha:] All letters  [A-Za-z]
[:alnum:] Digits and letters  [0-9A-Za-z]
[:xdigit:] Hexadecimal digits
[:digit:] Digits  [0-9]
[:punct:] Punctuation  . , ” ‘ ? ! ; : # $ % & ( ) * + – / < > = @ [ ] \ ^ _ { } | ~
[:blank:] Space and tab characters only  [\s \t ]
[:space:] Whitespace data characters  [  \t \v \r \f \n]
[:graph:] Anything excluding space, tab, control characters etc.; (printed characters)  [\x21-\x7E]
[:print:] Any printable character and space  [\x20-\x7E]
[:word:] Digits, letters and undescore  [0-9A-Za-z_]
[:cntrl:] Control characters  [\x00-\x1F\x7F]

Pattern modifiers
Character Description
g Global match, seeks for all the occurences of pattern in input string
m Multiline match
s Single-line mode. Dot in pattern matches all characters, including newlines
x Whitespaces in patterns are ignored
U Ungreedy pattern
i Case-insensitive matching

Regex Pattern modifiers examples are here.
All Regex examples were tested with whether Expresso program or MyRegexTester.com online tester. We sure you get benefited from this reference list with your comments and suggestions welcomed.