Let’s suppose you want to extract a price with a currency sign from a web page (eg. £220.00), but its HTML code is this:
which is obviously encoded HTML.
This can be pretty frustrating for you to parse the pure price with regex. How can you separate the price value from the currency notation?
Well, I’ll show you.
The pound sign (as well as other special characters), is an HTML-encoded entity and we simply need to decode it. So, how do you decode HTML encoded entities? In PHP you can use html_entity_decode()
function, while in Python HTML lib’s html.unescape
method.
PHP
$str ='cost: £220.00'; echo html_entity_decode($str);
Python >3.4
import html str=html.unescape('cost: £220.00') print(str)
One reply on “How to parse messy encoded HTML”
Hello
and thanks for this awesome website about scraping .
So in here you explained it in PHP and Python , but am facing this problem of messy HTML when using java and Jsoup library to parse some data . so what do u suggest to solve that ? and thanks in advance