Sometimes we have the code with html tags that contain nothing but whitespace characters. Often those tags are nested. See a code below:
<div>
<div>
<div></div>
</div>
</div>
What regex might be used to find and remove those tags?
Obvious solution is <div>\s*?<\/div>
.
\s
stands for “whitespace character”. It includes [ \t\n\x0B\f\r]
. That is: \s
matches a space(
) or a tab (\t
) or a line(\n
) break or a vertical tab (\x0B
) sometimes referred as (\v
) or a form feed (\f
) or a carriage return (\r
) .
General case
In general case, we use the following regex:<(?<tag>[a-z]+?)( [^>]+?|)>\s*?<\/(\k<tag>)>
where <tag>
is a named match group: [a-z]+?
JAVA code
When applying it recursively we might use the following code, JAVA:
public static String removeEmptyTags(String html) { boolean compareFound = true; Pattern pattern = Pattern.compile("<(?<tag>[a-z]+?)( [^>]+?|)>\\s*?</(\\k<tag>)>", Pattern.MULTILINE | Pattern.DOTALL); while (compareFound) { compareFound = false; Matcher matcher = pattern.matcher(html); if(matcher.find()) { compareFound = true; html = matcher.replaceAll(""); } } return html; }