Categories
Development

Cheerio scraper escapes special symbols with html entities when performing .html()

As developers scrape data off the web, we use Node.js along with handy Cheerio scraper. When fetching .html() Cheerio parser returns the special symbols as HTML encoded entities, eg.:
ä as ä
ß as ß

Cheerio developer vindication of the parser action

(1) It’s not the job of a parser to preserve the original document. 
(2) .html() returns an HTML representation of the parsed document, which doesn’t have to be equal to the original document.
source.

Urls are already encoded (as it is seen in the Chrome inspector)

Yet, some of urls are already html entities encoded,
eg. image url of a product: https://assets.einhell.com/im/imf/y400/900_412992/einhell-classic-inverter-schweissgerät-tc-iw-170-detailbild-ohne-untertitel-8.jpg
It is viewed in the code inspector as the following:

Solution

We are to decode back from HTML entities to the character representation.

  1. HTML entities & hex code table
  2. Regex to find HTML entities: \&\w{4};
  3. The replacement object for HTML entities and hex:
replacer = {'ö': 'ö', 'ö': 'ö',
     'ä': 'ä', 'ä': 'ä',
     'ü': 'ü', 'ü': 'ü',
     'ë': 'ë', 'ë': 'ë',
     'ß': 'ß','ß': 'ß'
};

A JavaScript code to decode the html entities back to the character symbols

var str = '<strong>L&#xE4;nge</strong><strong>H&#xF6;he</strong>... L&#xE4;nge...';
function replace_hex_all(str){ 
	replacer={'&#xF6;': 'ö', '&ouml;': 'ö',
			  '&#xE4;': 'ä', '&auml;': 'ä', 
			  '&#xFC;': 'ü', '&uuml;': 'ü', 
			  '&#xEB;': 'ë', '&euml;': 'ë'  
	};
	for (const [key, value] of Object.entries(replacer)) {
	  //console.log(`${key}: ${value}`);
	  while (str.indexOf(key)!= -1){
		  str = str.replace(key, value);
	  } 
	}  	
	 return str;
} 
console.log('new str: ', replace_hex_all(str));

Code at github: https://github.com/igorsavinkin/einhell-scraper/blob/main/hex_val_to_unicode.js

One reply on “Cheerio scraper escapes special symbols with html entities when performing .html()”

This does not help me. I have normal html with utf-8 encoding, but only Cyrillic symbols after cheerio.load(html) looks like ‘&#x431’ And I do not need it to decode; the html entities already encoded in my html! This is exactly what he does if you disable option { decodeEntities: false }.

Leave a Reply to Константин Cancel reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.