Regular expressions (Regex) – webscraping.pro

Regular expressions provide a concise and flexible means to “match” (specify and capture) strings of text, such as particular characters, words, or patterns of characters. Here we tried our best to present to you the most used Regexes with examples for your handy referencing.

Introduction

Regexes or Regular Expressions is a formal language for setting string patterns for complex searches. Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns.

There are many different dialects of regular expressions, all slightly different. Therefore, when referring to, include the tag for the speciﬁc programming language or tool (e.g., Perl, Ruby, Python, Java, JavaScript, vi, Emacs, sed, Lex, grep, etc.) you are using. It’s probably fair to say that Perl has the most robust regex engine in common use, so the Regex for Perl is most well spread and searched. The parse engines provide powerful and flexible means providing also extended Regex application, for example backreferencing, substittions, POSIX character classes and other. The online regex testers comparison table of different engines is available as well.
The modern Web Scraping is more or less wrapped up with Regexes, so for the developers’ sake we put out the Regex Cheat Sheet with examples. This would be a handy tool for your daily use.

Daily Regexes

Character classes	Metacharacters	Anchors
Groups and ranges	Substitutions	Quantifiers
Quantifiers modifier	Assertions	POSIX
Pattern modifiers	Conditions & Comments

Character classes
Character	Description	Example
`\с`	Control character	`\cx` matches ‘Ctrl+x’
`\s`	White space	`new\sgame` matches `new game`
`\S`	Not white space	`\S+` matches `cool` in `cool`
`\d`	Digit	`\d` matches `1` `3` `5` in `1st 3d 5th`
`\D`	Not digit	`\D+` matches `Harley road` in `2500 Harley road`
`\w`	Matches any word character and underscore	`\w` matches `A` `1` `2` `9` `6` `_` in `A-12-96_`
`\W`	Matches any non-word character	`\W` matches `-` `-` `,` in `A-12-96,`

Metacharacters and Escape sign
Character	Description	Example
`.*+?^&$()[]{}\\|/`	Characters that go for special meaning unless backslashed ‘\’	`\^`, `\$`, `\\` match `^`, `$`, `\`
`\`	Escape character (backslash) changes letters for their special meaning (see also ‘Character classes’). It also escapes metacharacters to suppress their special meaning (see the cell above).	`\a` matches Bell, hex07; `\r` matches Carriage Return, hex0D; `\n` matches New Line, hex0A; `\t` matches Tab, hex09; `\v` matches Vertical Tab, hex0B; `\f` matches Form Feed, hex0C; `\ddd` matches ASCII octal code ddd; `\xhh` matches ASCII hex code hh; `\uxxxx` matches Unicode character expressed in hexadecimal notation with exactly four numeric digits xxxx

Anchors (match position)

`^`	Start of string	`^a` matches only first `a` in `abcadaef`.
`$`	End of string	`.$` matches `f` in `abcdef`
`\A`	Start of string. Never matches after line breaks.	`\A.` matches only `a` in `abc dfg hij`
`\Z \z`	End of string. If ‘\Z’, then also before new line at the end	`.\Z` matches `f` in `abcdef`
`\b`	String boundary	`.\b` matches `c` in `abc`; `\b.` matches `a` in `abc`
`\B`	Not string boundary	`\B.*\B` matches `bcd defg hi` in `abcd defg hij`
`\G`	The match must occur at the point where the previous match ended	`\G$\d$` matches `(1)(3)` in `(1)(3)[4](5)`

Groups and Ranges
Character	Description	Example
`.` (dot)	Any character except new line (\n)	`a.c` matches `abc`, `asc` or `adc`
`\|` (pipe)	Logical ‘OR’.	`abc(def\|xyz)` matches `abcdef` or `abcxyz`
`( )`	Group (active) for capture and counts	`(abc)` matches `abc` in `34abcdef\t`
`(?: )`	Passive Group (non-capturing)	`\d(?:abc)` captures `4abc` in `34abcdef` but `abc` being a passive capture group, while `4` is in active one.
`[ ]`	Range	`[a-zA-Z0-9]` matches any letter or digit
`[^ ]`	Negative (exclusive) Range	`[^a-d4-6]` matches any character except a, b, c or d, or 4,5 or 6
`( )\n`	Numbered group ( ) starting with 1; \n is an instance of nth group to be present in search string	`(abc)\1\s(ghi)\1\2` matches `abcabc ghiabcghi`; where abc and ghi are numbered as 1st and 2nd capture groups respectively.
`(?<name> )`	Named group to be used as backreference	`(?<Day>\d{1,2})` matches `12` in `on June 12` and the match named as Day capture group.

Substitutions (Backreferences)
Character	Description	Example
`$n`	Substitutes the substring matched by group number n	Applying `^(\w+)\s(\w+)$` to `Mary Seay` with `$2, $1` as a replace pattern results in `Seay, Mary`
`${name}`	Substitutes the substring matched by the named group name.**	Applying `(?<word1>\w+)\s(?<word2>\w+)` to `ABC DEF` with `${word2} ${word1}` as a replace pattern results in `DEF ABC`
$`	Substitutes all the text of the input string before the matched string	Applying `While` to `Kris While` with $` as a replace pattern results in `Kris Kris`
`$'`	Substitutes all the text of the input string after the matched string	Applying `3+` to `1122334455` with `$'` as a replace pattern results in `112244554455`
`$+`	Substitutes the last group that was captured	Applying `(\d+)(USD)?` to `200 USD` with `$$$1` as a replace pattern results in `$200`
`$_`	Substitutes the entire input string	Applying `regex` to `New regex Code` with `$_` as a replace pattern results in `New New regex Code Code`
`$&`	Substitutes a copy of the whole match	Applying `(\$(\d(\.+\d+)?){1})` to `1.4590` with `*$&` as a replace pattern results in `$1.4590**`
`$$`	Substitutes a $ character	Applying `(\d+)\s*(USD)?` to `200 USD` with `$$$1` as a replace pattern results in `$200`

**Not all Regex libraries support the named group substitutions

Quantifiers
Character	Description	Example
`?`	0 or 1 times	`ab(c?)` matches `ab` or `abc`
`*`	0 or more times	`".*?"` matches `"def"`, `"g"` and `""` in `abc"def" "g" "" jkl` (here asteric is not greedy because of a ? following it)
`+`	1 or more times	`B+` matches `BB` and `B` in `aaaBBccccBddd`
`{n}`	Exactly n times	`a{3}` matches `aaa`
`{n,m}`	Between n and m times. Greedy if not restricted.	`a{2,4}` matches only `aaaa` in `aaaaa`
`{n,}`	At least n times. Greedy.	`a{2,}` matches `aaaaa` in `aaaaa` … Greedy!
Quantifier Modifier
`?` (if after a quantifier)	Makes a quantifier not greedy, rather lazy.	`be+?` matches `be` in `been` (not `bee`)

Assertions
Character	Description	Example
`?=`	Zero-width lookahead assertion	`\w+(?=\.)` matches `sleep`, `ill` and `Oops` in `We sleep. The man is ill. Oops.!`
`?!`	Zero-width negative lookahead assertion	`\b\w+(?!\.)\b` matches `We`, `The`, `man` and `is` in `We sleep. The man is ill. Oops.!`
`?<=`	Zero-width lookbehind assertion	`(?<=DDR2)\s+\w+` matches `2MB` in `DDR2 2MB`
`?<!=`	Zero-width negative lookbehind assertion	`(?<!=20)\d{2}\b` matches `45`, `76` in `1945 2012 1876 2002`

Conditions and comment
Character	Description	Example
`(?(expression) yes \| no )`	Matches yes if expression matches; otherwise, matches the optional no part.	`(?(A)A\d{3}\b\|\b\d{3}\b)` matches `A380`, `747`, `400` in `A380 C103 747-400`
`(?( name ) yes \| no )`	Matches yes if the name capture has a match; otherwise, matches the optional no.	`(?")?(?(quot).+?"\|\S+\s)` matches `rock.mp3` and `"new song.mp3"` in `rock.jpg "new song.jpg"`
`(?# )`	Comment text in brackets	`(?# a new regex :-))`

POSIX
Common range	Description	Notes
`[:upper:]`	Upper case letters	[A-Z]+
`[:lower:]`	Lower case letters	[a-z]
`[:alpha:]`	All letters	[A-Za-z]
`[:alnum:]`	Digits and letters	[0-9A-Za-z]
`[:xdigit:]`	Hexadecimal digits
`[:digit:]`	Digits	[0-9]
`[:punct:]`	Punctuation	. , ” ‘ ? ! ; : # $ % & ( ) * + – / < > = @ [ ] \ ^ _ { } \| ~
`[:blank:]`	Space and tab characters only	[\s \t ]
`[:space:]`	Whitespace data characters	[ \t \v \r \f \n]
`[:graph:]`	Anything excluding space, tab, control characters etc.; (printed characters)	[\x21-\x7E]
`[:print:]`	Any printable character and space	[\x20-\x7E]
`[:word:]`	Digits, letters and undescore	[0-9A-Za-z_]
`[:cntrl:]`	Control characters	[\x00-\x1F\x7F]

Pattern modifiers
Character	Description
`g`	Global match, seeks for all the occurences of pattern in input string
`m`	Multiline match
`s`	Single-line mode. Dot in pattern matches all characters, including newlines
`x`	Whitespaces in patterns are ignored
`U`	Ungreedy pattern
`i`	Case-insensitive matching

Regex Pattern modifiers examples are here.
All Regex examples were tested with whether Expresso program or MyRegexTester.com online tester. We sure you get benefited from this reference list with your comments and suggestions welcomed.