Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Suppose we need to extract all the items under “List:” header:

XPath

To get item1 through item4 extracted let’s set a bookmark that will be:

<h1>List:</h1>

At the bottom line, let’s set another bookmark that will be:

The useful built-in function for text processing in XPath is text(). So, just apply it to the snippet and set the bookmarks in square brackets:

//text()[preceding-sibling::h1[1] = ‘List:’ and following-sibling::h2 = ‘The End’]

This is the result, expressed in separate text nodes:

Regex

This is how to parse the items with a Regex expression and get the results shown above:

(?<item>[^<>]+?)\s*(?:(<br/>)|(<h2>))

First capture the target <item> group with everything inside but ‘<’ or ‘>’. Then, remove blank spaces if present. Following that should be whether <br/> or <h2>, yet as uncaptured group.

The above Regex does not connect to the headers (<h1>List:</h1>,  <h2>End</h2>). If we want to extract using anchors for generalizing, use zero-width lookahead and lookbehind assertions (first and last capture groups in the following Regex).

(?<=<h1>List:</h1>)\s*((?<item>[^<>]+?)\s*(?:(<br/>)|(<h2>)))+\s*(?=The\sEnd</h2>)

NOTE: With this Regex, we get the items into the <item> capture group,  but because we are using a quantifier (+), after Regex processing we need to iterate through all the stored captured elements of the group: <item>[i].

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.