Categories
Web Scraping Software

TEST DRIVE: Invalid HTML

Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.

The test’s tasks you might also see at the following picture:

The overall results are summed up in following table:

[table “old_19” not found /]

Dexi.io

The result of scraper is this:

2>1 & 1<2 < div>
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header

Dexi.io has correctly replaced the signs <, > and & in the HTML code. The nonHTML tag has been properly closed. Invalid ‘<span<span>’ tag was accepted as custom and was closed. Inside of the ‘a’ tag with the address ‘millepah.com’, a space was put between the attributes ‘id’ and ‘href’ to make it work. Incorrectly placed ‘<div>’ and ‘<span>’ tags around the line ‘bad attachment’ have been put in the correct sequence. Dexi.io used the encoding from the HTTP header. Using the action “Search and grab” you can get a misspelled link (href=”http://scrapetools.com ‘). For this the following regular expressions was used: href=(“|’)([^'”]+?)(“|’)

Web Content Extractor

The result for this scraper is as follows:

2>1 & 1<2 nonHTML unclosed scrapetools.com” millepah.com bad nesting проверка (windows-1251) wrong meta проверка (utf-8) wrong header

The scraper has successfully done almost all the tasks. The result shows that the scraper can scrape the unmatched quotes and it pays more attention to the http-header rather than to the meta tag.

Screen Scraper

With this scraper, which does extraction by Regex, one needs to set the extraction pattern with Regex expressions or something similar. That might be fitting for an invalid HTML scrape, but one cannot predict what the mistake will be with your target for scrape. So I did a scrape with one general pattern. Result:

2>1 & 1<2
nonHTML
unclosed
 
millepah.com
bad nesting
(windows-1251) wrong meta
проверка (utf-8) wrong header
The Unmatched quotes (<a href=scrapetools.com>) are not noted by it. Since you can see “проверка (utf-8) wrong header” but not “(windows-1251) wrong meta” this scraper pays attention to the http-header more than to the meta tag.

Visual Web Ripper

The scrape result is as follows:

2>1 & 1<2 nonHTML unclosedmilleph.com bad nesting проверка (windows-1251) wrong meta проверка (utf-8) wrong header

Again the scraper has not found the unmatched quotes link, and it paid more attention to the http-header, rather than to the meta tag.

Update

After some consultation and check with Sequentum tech team, I realyzed the VWR is able to extract the unmatched quotes link using the following regex:

href=(“|’)([^'”]+?)(“|’)

having the 2-nd capture group as the result.


Content Grabber

We’ve done the test on the Content Grabber and it issued almost the same result as its predecessor, VWR.

2>1 & 1<2
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header

The scraper paid more attention to the http-header, rather than to the meta tag.

As far as the extracing of the unmatched quotes links, Content Grabber might be programmed to do it. Just grab the whole area, choose Inner HTML and to use the following regex in the transformation script to refine a link:

href=(“|’)([^'”]+?)(“|’)

having the 2-nd capture group as the result: return $2

OutWit Hub

OutWit Hub Result is as follows:

2>1 & 1<2
nonHTML
unclosed
 
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header

This scraper fails to scrape unmatched quotes, and again it is more header attentive than meta tag.

WebSundew Scraper

Result is as follows:

(for utf-8 encoding saved) 2>1 & 1<2 nonHTML unclosed “” millepah.com bad nesting проверка (windows-1251) wrong meta проверка (utf-8) wrong header

Again the unmatched quotes failure and meta tag non-attentive scraper.

Helium Scraper

The result for Helium Scraper is this:

2>1 & 1<2
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header

Unmatched quotes problem and http-header attentive, rather than meta tag.

Mozenda

Result is here:

Again the scraper missed the unmatched quotes element, and it paid more attention to the header than to the meta tag.

Easy Web Extract

The result is as follows:

2>1 & 1<2
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header

The scraper did not recognize the unmatched quotes link and paid more attention to the header rather than to the meta tag.

Summary

The scrapers have generally done satisfactorily, passing 5 out of 7 tasks. The Web Content Extractor, Visual Web Ripper and Content Grabber (6 out of 7 tasks rate) did the best. WCE could scrape the unmatched quotes links and VWR and CG are good in regex application to the deliberate page area (text transformation). The rest failed with unmatched quotes (single quote ‘ instead of double one “) recognition. The attention to the meta tag or the http-header differentiated the scrapers. See the table above.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.