Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.
The test’s tasks you might also see at the following picture:
The overall results are summed up in following table:
[table “old_19” not found /]
Dexi.io
The result of scraper is this:
2>1 & 1<2 < div>
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header
Dexi.io has correctly replaced the signs <, > and & in the HTML code. The nonHTML tag has been properly closed. Invalid ‘<span<span>’ tag was accepted as custom and was closed. Inside of the ‘a’ tag with the address ‘millepah.com’, a space was put between the attributes ‘id’ and ‘href’ to make it work. Incorrectly placed ‘<div>’ and ‘<span>’ tags around the line ‘bad attachment’ have been put in the correct sequence. Dexi.io used the encoding from the HTTP header. Using the action “Search and grab” you can get a misspelled link (href=”http://scrapetools.com ‘). For this the following regular expressions was used: href=(“|’)([^'”]+?)(“|’)
Web Content Extractor
The result for this scraper is as follows:
2>1 & 1<2 nonHTML unclosed scrapetools.com” millepah.com bad nesting проверка (windows-1251) wrong meta проверка (utf-8) wrong header
The scraper has successfully done almost all the tasks. The result shows that the scraper can scrape the unmatched quotes and it pays more attention to the http-header rather than to the meta tag.
Screen Scraper
With this scraper, which does extraction by Regex, one needs to set the extraction pattern with Regex expressions or something similar. That might be fitting for an invalid HTML scrape, but one cannot predict what the mistake will be with your target for scrape. So I did a scrape with one general pattern. Result:
Visual Web Ripper
The scrape result is as follows:
2>1 & 1<2 nonHTML unclosedmilleph.com bad nesting проверка (windows-1251) wrong meta проверка (utf-8) wrong header
Again the scraper has not found the unmatched quotes link, and it paid more attention to the http-header, rather than to the meta tag.
Update
After some consultation and check with Sequentum tech team, I realyzed the VWR is able to extract the unmatched quotes link using the following regex:
having the 2-nd capture group as the result.
Content Grabber
We’ve done the test on the Content Grabber and it issued almost the same result as its predecessor, VWR.
2>1 & 1<2
nonHTML
unclosed
millepah.com
bad nesting
проверка (windows-1251) wrong meta
проверка (utf-8) wrong header
The scraper paid more attention to the http-header, rather than to the meta tag.
As far as the extracing of the unmatched quotes links, Content Grabber might be programmed to do it. Just grab the whole area, choose Inner HTML and to use the following regex in the transformation script to refine a link:
having the 2-nd capture group as the result: return $2
OutWit Hub
OutWit Hub Result is as follows:
This scraper fails to scrape unmatched quotes, and again it is more header attentive than meta tag.
WebSundew Scraper
Result is as follows:
(for utf-8 encoding saved) 2>1 & 1<2 nonHTML unclosed “” millepah.com bad nesting проверка (windows-1251) wrong meta проверка (utf-8) wrong header
Again the unmatched quotes failure and meta tag non-attentive scraper.
Helium Scraper
The result for Helium Scraper is this:
2>1 & 1<2 |
nonHTML |
unclosed |
millepah.com |
bad nesting |
проверка (windows-1251) wrong meta |
проверка (utf-8) wrong header |
Unmatched quotes problem and http-header attentive, rather than meta tag.
Mozenda
Result is here:
Again the scraper missed the unmatched quotes element, and it paid more attention to the header than to the meta tag.
Easy Web Extract
The result is as follows:
The scraper did not recognize the unmatched quotes link and paid more attention to the header rather than to the meta tag.
Summary
The scrapers have generally done satisfactorily, passing 5 out of 7 tasks. The Web Content Extractor, Visual Web Ripper and Content Grabber (6 out of 7 tasks rate) did the best. WCE could scrape the unmatched quotes links and VWR and CG are good in regex application to the deliberate page area (text transformation). The rest failed with unmatched quotes (single quote ‘ instead of double one “) recognition. The attention to the meta tag or the http-header differentiated the scrapers. See the table above.