How to parse messy encoded HTML

Post author By admin
Post date November 11, 2015
1 Comment on How to parse messy encoded HTML

Let’s suppose you want to extract a price with a currency sign from a web page (eg. £220.00), but its HTML code is this:

which is obviously encoded HTML.

This can be pretty frustrating for you to parse the pure price with regex. How can you separate the price value from the currency notation?

Well, I’ll show you.

The pound sign (as well as other special characters), is an HTML-encoded entity and we simply need to decode it. So, how do you decode HTML encoded entities? In PHP you can use html_entity_decode() function, while in Python HTML lib’s html.unescape method.

PHP

$str ='cost: &#163;220.00'; 
echo html_entity_decode($str);

Python >3.4

import html
str=html.unescape('cost: &#163;220.00') 
print(str)

Tags PHP, Python

One reply on “How to parse messy encoded HTML”

Hello
and thanks for this awesome website about scraping .
So in here you explained it in PHP and Python , but am facing this problem of messy HTML when using java and Jsoup library to parse some data . so what do u suggest to solve that ? and thanks in advance

PHP

Python >3.4

One reply on “How to parse messy encoded HTML”

Leave a Reply Cancel reply