Yes, I’m aware that using regex for HTML parsing is not the best idea. But still when I need to quickly extract some small portion of a web page I find myself applying regex more often than executing an XPath query, and its lookahead and lookbehind constructions may be quite helpful.
Regex Lookaround Syntax
There are four regex syntax constructions for lookaround:
positive lookahead
match(?=expr)
the “expr” has to be found after the matching value
negative lookahead
match(?!expr)
the “expr” has to not be found after the matching value
positive lookbehind
(?<=expr)match
the “expr” has to be found before the matching value
negative lookbehind
(?<!expr)match
the “expr” has to not be found before the matching value
Usage Example
For example, you need to extract the amount value from the following piece of source HTML:
<span>AWS Service Charges: $69</span>
You can do it using regex groups:
(\d+)\s*</
or using regex lookaround:
(?<=AWS Service Charges:\s*\$)\d+(?=\s*<)/
Though the second regex expression looks more complicated, the difference in these approaches is obvious:
- in case of groups, the whole text block is matched and you need to extract the group separately
- in case of lookaround, only the needed value is matched and ready for use
Limitations
There are some limitations with using regex lookaround though:
- JavaScript does not support lookaround syntax
- In Python and PCRE, lookbehinds must have a fixed length. This means that you can’t use quantifiers or alternation within lookbehind