Categories
Development

Using Regex Lookaround for HTML element extraction

Yes, I’m aware that using regex for HTML parsing is not the best idea. But still when I need to quickly extract some small portion of a web page I find myself applying regex more often than executing an XPath query, and its lookahead and lookbehind constructions may be quite helpful.

Regex Lookaround Syntax

There are four regex syntax constructions for lookaround:

positive lookahead

match(?=expr)

the “expr” has to be found after the matching value

negative lookahead

match(?!expr)
the “expr” has to not be found after the matching value

positive lookbehind

(?<=expr)match
the “expr” has to be found before the matching value

negative lookbehind

(?<!expr)match
the “expr” has to not be found before the matching value

Usage Example

For example, you need to extract the amount value from the following piece of source HTML:

<span>AWS Service Charges: $69</span>
Note that it’s almost impossible to do it with pure XPath

You can do it using regex groups:
(\d+)\s*</
or using regex lookaround:

(?<=AWS Service Charges:\s*\$)\d+(?=\s*<)/
Though the second regex expression looks more complicated, the difference in these approaches is obvious:

  • in case of groups, the whole text block is matched and you need to extract the group separately
  • in case of lookaround, only the needed value is matched and ready for use

Limitations

There are some limitations with using regex lookaround though:

  • JavaScript does not support lookaround syntax
  • In Python and PCRE, lookbehinds must have a fixed length. This means that you can’t use quantifiers or alternation within lookbehind

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.