Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.
Suppose we need to extract all the items under “List:” header:
<html>
some text
<h1>List:</h1>
item1
<br/>
item2
<br/>
item3
<br/>
item4
<h2>The End</h2> some other text... </html>
XPath
To get item1 through item4 extracted let’s set a bookmark that will be:
<h1>List:</h1>
At the bottom line, let’s set another bookmark that will be:
<h2>The End</h2>
The useful built-in function for text processing in XPath is text(). So, just apply it to the snippet and set the bookmarks in square brackets:
This is the result, expressed in separate text nodes:
item1
-----------------------
item2
-----------------------
item3
-----------------------
item4
Regex
This is how to parse the items with a Regex expression and get the results shown above:
First capture the target <item> group with everything inside but ‘<’ or ‘>’. Then, remove blank spaces if present. Following that should be whether <br/> or <h2>, yet as uncaptured group.
The above Regex does not connect to the headers (<h1>List:</h1>, <h2>End</h2>). If we want to extract using anchors for generalizing, use zero-width lookahead and lookbehind assertions (first and last capture groups in the following Regex).
NOTE: With this Regex, we get the items into the <item> capture group, but because we are using a quantifier (+), after Regex processing we need to iterate through all the stored captured elements of the group: <item>[i].