Let’s suppose you want to extract a price with a currency sign from a web page (eg. £220.00), but its HTML code is this:
which is obviously encoded HTML.
This can be pretty frustrating for you to parse the pure price with regex. How can you separate the price value from the currency notation?
Well, I’ll show you.
The pound sign (as well as other special characters), is an HTML-encoded entity and we simply need to decode it. So, how do you decode HTML encoded entities? In PHP you can use
html_entity_decode() function, while in Python HTML lib’s
$str ='cost: £220.00'; echo html_entity_decode($str);
import html str=html.unescape('cost: £220.00') print(str)