Categories
Development

How to parse messy encoded HTML

Let’s suppose you want to extract a price with a currency sign from a web page (eg. £220.00), but its HTML code is this:

<div>cost: &#163;220.00</div>

which is obviously encoded HTML.

This can be pretty frustrating for you to parse the pure price with regex. How can you separate the price value from the currency notation?

Well, I’ll show you.

The pound sign (as well as other special characters), is an HTML-encoded entity and we simply need to decode it. So, how do you decode HTML encoded entities? In PHP you can use html_entity_decode() function, while in Python HTML lib’s html.unescape method.

PHP

$str ='cost: &#163;220.00'; 
echo html_entity_decode($str);

Python >3.4

import html
str=html.unescape('cost: &#163;220.00') 
print(str)

One reply on “How to parse messy encoded HTML”

Hello
and thanks for this awesome website about scraping .
So in here you explained it in PHP and Python , but am facing this problem of messy HTML when using java and Jsoup library to parse some data . so what do u suggest to solve that ? and thanks in advance

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.