Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Suppose we need to extract all the items under “List:” header:

<html>
some text
<h1>List:</h1>
item1
<br/>
item2
<br/>
item3
<br/>
item4
<h2>The End</h2> some other text... </html>

XPath

To get item1 through item4 extracted let’s set a bookmark that will be:

<h1>List:</h1>

At the bottom line, let’s set another bookmark that will be:

<h2>The End</h2>

The useful built-in function for text processing in XPath is text(). So, just apply it to the snippet and set the bookmarks in square brackets:

//text()[preceding-sibling::h1[1] = ‘List:’ and following-sibling::h2 = ‘The End’]

This is the result, expressed in separate text nodes:

item1
-----------------------
item2
-----------------------
item3
-----------------------
item4

Regex

This is how to parse the items with a Regex expression and get the results shown above:

(?<item>[^<>]+?)\s*(?:(<br/>)|(<h2>))

First capture the target <item> group with everything inside but ‘<’ or ‘>’. Then, remove blank spaces if present. Following that should be whether <br/> or <h2>, yet as uncaptured group.

The above Regex does not connect to the headers (<h1>List:</h1>, <h2>End</h2>). If we want to extract using anchors for generalizing, use zero-width lookahead and lookbehind assertions (first and last capture groups in the following Regex).

(?<=<h1>List:</h1>)\s*((?<item>[^<>]+?)\s*(?:(<br/>)|(<h2>)))+\s*(?=The\sEnd</h2>)

NOTE: With this Regex, we get the items into the <item> capture group, but because we are using a quantifier (+), after Regex processing we need to iterate through all the stored captured elements of the group: <item>[i].

XPath

Regex

Leave a Reply Cancel reply