TEST DRIVE: Invalid HTML – webscraping.pro

Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.

The test’s tasks you might also see at the following picture:

The overall results are summed up in following table:

[table “old_19” not found /]

Dexi.io

The result of scraper is this:

2>1 & 1<2 < div>
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

Dexi.io has correctly replaced the signs <, > and & in the HTML code. The nonHTML tag has been properly closed. Invalid ‘<span<span>’ tag was accepted as custom and was closed. Inside of the ‘a’ tag with the address ‘millepah.com’, a space was put between the attributes ‘id’ and ‘href’ to make it work. Incorrectly placed ‘<div>’ and ‘<span>’ tags around the line ‘bad attachment’ have been put in the correct sequence. Dexi.io used the encoding from the HTTP header. Using the action “Search and grab” you can get a misspelled link (href=”http://scrapetools.com ‘). For this the following regular expressions was used: href=(“|’)([^'”]+?)(“|’)

back to results

Web Content Extractor

The result for this scraper is as follows:

2>1 & 1<2 nonHTML unclosed scrapetools.com” millepah.com bad nesting проверка (windows-1251) wrong meta РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

The scraper has successfully done almost all the tasks. The result shows that the scraper can scrape the unmatched quotes and it pays more attention to the http-header rather than to the meta tag.

back to results

Screen Scraper

With this scraper, which does extraction by Regex, one needs to set the extraction pattern with Regex expressions or something similar. That might be fitting for an invalid HTML scrape, but one cannot predict what the mistake will be with your target for scrape. So I did a scrape with one general pattern. Result:

2>1 & 1<2

nonHTML

unclosed

millepah.com

bad nesting

(windows-1251) wrong meta

проверка (utf-8) wrong header

The Unmatched quotes (<a href=“scrapetools.com‘>) are not noted by it. Since you can see “проверка (utf-8) wrong header” but not “(windows-1251) wrong meta” this scraper pays attention to the http-header more than to the meta tag.

back to results

Visual Web Ripper

The scrape result is as follows:

2>1 & 1<2 nonHTML unclosedmilleph.com bad nesting проверка (windows-1251) wrong meta РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

Again the scraper has not found the unmatched quotes link, and it paid more attention to the http-header, rather than to the meta tag.

Update

After some consultation and check with Sequentum tech team, I realyzed the VWR is able to extract the unmatched quotes link using the following regex:

href=(“|’)([^'”]+?)(“|’)

having the 2-nd capture group as the result.

back to results

Content Grabber

We’ve done the test on the Content Grabber and it issued almost the same result as its predecessor, VWR.

2>1 & 1<2
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

The scraper paid more attention to the http-header, rather than to the meta tag.

As far as the extracing of the unmatched quotes links, Content Grabber might be programmed to do it. Just grab the whole area, choose Inner HTML and to use the following regex in the transformation script to refine a link:

href=(“|’)([^'”]+?)(“|’)

having the 2-nd capture group as the result: return $2

back to results

OutWit Hub

OutWit Hub Result is as follows:

2>1 & 1<2

nonHTML

unclosed

millepah.com

bad nesting

проверка (windows-1251) wrong meta

РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

This scraper fails to scrape unmatched quotes, and again it is more header attentive than meta tag.

back to results

WebSundew Scraper

Result is as follows:

(for utf-8 encoding saved) 2>1 & 1<2 nonHTML unclosed “” millepah.com bad nesting проверка (windows-1251) wrong meta РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

Again the unmatched quotes failure and meta tag non-attentive scraper.

back to results

Helium Scraper

The result for Helium Scraper is this:

2>1 & 1<2

nonHTML

unclosed

millepah.com

bad nesting

проверка (windows-1251) wrong meta

РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

Unmatched quotes problem and http-header attentive, rather than meta tag.

back to results

Mozenda

Result is here:

Again the scraper missed the unmatched quotes element, and it paid more attention to the header than to the meta tag.

back to results

Easy Web Extract

The result is as follows:

2>1 & 1<2

nonHTML

unclosed

millepah.com

bad nesting

проверка (windows-1251) wrong meta

РїСЂРѕРІРµСЂРєР° (utf-8) wrong header

The scraper did not recognize the unmatched quotes link and paid more attention to the header rather than to the meta tag.

back to results

Summary

The scrapers have generally done satisfactorily, passing 5 out of 7 tasks. The Web Content Extractor, Visual Web Ripper and Content Grabber (6 out of 7 tasks rate) did the best. WCE could scrape the unmatched quotes links and VWR and CG are good in regex application to the deliberate page area (text transformation). The rest failed with unmatched quotes (single quote ‘ instead of double one “) recognition. The attention to the meta tag or the http-header differentiated the scrapers. See the table above.

The test’s tasks you might also see at the following picture:

The overall results are summed up in following table: [table “old_19” not found /]

Update

Summary

Leave a Reply Cancel reply

The overall results are summed up in following table:

[table “old_19” not found /]