Categories
Web Scraping Software

TEST DRIVE: Text list

We’d like to introduce the new SCRAPER TEST DRIVE stage, called ‘Text list‘. This seemingly simple test case hides within itself a non-ordinary structure. This time the HTML DOM structure is so plain, making you scratch your head, wondering how to approach to it. Yet, those off-the-shelf products have shown their best features extracting even a smallest […]

We’d like to introduce the new SCRAPER TEST DRIVE stage, called ‘Text list‘. This seemingly simple test case hides within itself a non-ordinary structure. This time the HTML DOM structure is so plain, making you scratch your head, wondering how to approach to it. Yet, those off-the-shelf products have shown their best features extracting even a smallest thing from seemingly plain content.

Here is the text list content:

The source code is from pages on which publishers don’t take the trouble to format their data using HTML elements:

 
CITY       POPULATION ————————
New York      8,244,910
(City of New York)
Los Angeles   3,819,702
Chicago       2,707,120
change: +0.43%
Houston       2,145,146
Philadelphia  1,536,471
(City of Philadelphia)
Phoenix       1,469,471

Overview

It seems that no XPath may be applied to this HTML markup. I failed to compose projects on the fly in some scrapers and you may experience the same. Along the way, I’ve realized those lists are Regex-only cases, so I developed a Regex expression to fetch cities and population values while bypassing everything else.

\b(?<City>[A-Z]{1}([a-z]\s)+)\s*(?<Population>\d[\d,]+)

For Regex basics, go to Regex expressions post. Also, I referred to the Extracting sequential elements post; that was helpful for this case.

The scrapers have shown their might on this tiny task. Yes, this task, especially comments related to previous line record, is not for a typical user to build, but rather for advanced one. In almost all the scrapers, to compose a project I had to turn to the scraper’s support for help.

As I perform the testing, I’ll be adding the results to this post. So, don’t be concerned if your favorite scraper is not yet in the result list. At this point, OutWit Hub, Visual Web Ripper,  Mozenda and Web Content Extractor have been tested, showing good results. Every scraper project/agent works effectively  in the scraping of all the text list versions (1-5). And each one has performed this in its own way, using both Regex and XPath.

Overall results

[table “3” not found /]

Dexi.io

Extraction data from a list in Dexi.io did not go very smooth. Regular expressions were used to separate the main lines (with the name of the city and population) from the note lines. The first letter of the string should be checked. If the line begins with a capital letter, this is the main line. If not, then this is note. In Dexi.io extractor, in the steps ‘Search and grab’ and ‘HTML contains’the ‘i’ flag is automatically set to the regular expression. With it, when searching by regex, a case will be ignored. Thus, lines that begin with a lowercase letter will also be considered the main lines.
Dexi.io support has said that it is impossible to remove the ‘i’ flag, just like adding others. Therefore, in order to extract these data extraction we used Javascript.
First, create output fields. In my example: city, population and notes.
Second, after the ‘Go to URL’ step, add the 3 ‘Remove element’ steps (for 3 points there is only one):
a) In the first, write CSS selector ‘div#case_textlist > b:nth-child(1)’ in the ‘Element paths’ field.
b) In the second, write ‘br’ and mark the checkbox ‘Remove all’ (only for points 1 and 2).
c) In the third, write ‘#case_textlist > span:nth-child (1)’ (only for points 1 and 2).
d) And after the first step of ‘Remove element’ we add the step of ‘HTMLify text’. There we write ‘#case_textlist’ (only for points 1 and 2).
Third, create a loop ‘Loop through elements’, before the step ‘Save current output’. In the field ‘Element paths’ write ‘#case_textlist *’ (for 1 and 2 points) or ‘#case_textlist b’ (for 3 points). Inside the loop, add the ‘Execute Javascript’ step (for 2 points, add another):
a) In the field ‘Element paths’ write ‘:self’. In the ‘Javascript’ field, write the following:

var str = $(this).text();
var result = str.match(/^[A-Z]/);

if(result)
{
    var city = str.match(/[A-Za-z\s]+/);
    var population = str.match(/[0-9,]+/);
    $output.city = city[0];
    if(population)
        $output.population = population[0];
}

b) In order to scrape cities with their notes (if any). In the second step of ‘Execute Javascript’, in the field ‘Element paths’ write ‘:self+span’. In the main settings, in the ‘Error handling’ field select ‘Ignore and continue’. And add the following Javascript:

var str = $(this).text();
var result = str.match(/^[^A-Z]/);

if(result)
    $output.notes = str;

 

Web Content Extractor

Initially, I failed to compose a project with this scraper, but soon the support system provided 3 projects for this case’s 3 requirements. Composing a project turned out to be fairly simple, the city and its population fields being split with pre-defined Visual Basic script.

The bold records scrape project was also an easy task to compose, but for getting the records including notes, the scraper allowed a note to be in the ‘City/Notes‘ field only. This might not match the output scrape requirements. In the image, I’ve circled in red some notes. Obviously, this kind of output may give a wrong impression of records’ value and number.

I’m basically satisfied with this scraper as far as the ‘bare’ text list scrape, since to fetch records including notes is a tricky task. More Regex functionality for further refining could be added to the scraper.

Visual Web Ripper

Transformation Makes Everything Scrape-able

Visual Web Ripper has features for content transformation that seem to be unique among scrapers. The table denormalization in the Table Report test has used the same scraper trait.

The custom script transformation makes the plain text list a table, ready for the scrape:

  1. Choose the content in the browser area
  2. Create a new content element (press New  )
  3.  Choose ‘Transformation‘ in drop down list
  1. At the Options area (left) in the drop down list, choose ‘Custom‘ and press ‘Transformation Script
  1. Compose or insert any pre-made script (see the script below; test it on the spot – click ‘Transform‘) and click ‘Save
  1. Then, press the ‘Transform Page Now’ button, which will immediately transform the content for the next step

With help of support, I quickly realized that I need to use Page Transformation to transform the original text content and then extract the data as I would normally. My poor transformation was of no avail; the result is hardly further extract-able, since there are still no HTML anchors for the scraper to hook on. Perhaps I need to insert more HTML markup in this stage to hook on it? Exactly. That’s what support has done for me. I believe they are much more familiar with inbuilt scraper functionality than I am.
The Regex queries script transforms the plain text (point 5 in the above list) into a table. Don’t forget to tick mark ‘Use HTML as regex input‘ if you want to see changes. The script listing:

<b>
replace
</b>
replace
(.*)<br>
replace <table><tr><td>$1</td></tr></table>
<br>
replace </td></tr><tr><td>
(&nbsp;)+
replace </td><td>

I can easily understand these transformation lines, as I am familiar with Regex. Are they for you? If I need to leave the bold records marked, I just omit the first four strings:

<b>
replace
</b>
replace

This is result, the plain HTML table:

<TABLE>
<TBODY>
<TR>
<TD>CITY</TD>
<TD>POPULATION</TD></TR>
<TR>
<TD>------------------------</TD></TR>
<TR>
<TD>New York</TD>
<TD>8,244,910</TD></TR>
<TR>
<TD>(City of New York)</TD></TR>
<TR>
<TD>Los Angeles</TD>
<TD>3,819,702</TD></TR>
<TR>
<TD>Chicago</TD>
<TD>2,707,120</TD></TR>
<TR>
<TD>change: +0.43%</TD></TR>
<TR>
<TD>Houston</TD>
<TD>2,145,146</TD></TR>
<TR>
<TD>Philadelphia</TD>
<TD>1,536,471</TD></TR>
<TR>
<TD>(City of Philadelphia)</TD></TR>
<TR>
<TD>Phoenix</TD>
<TD>1,469,471</TD></TR></TBODY></TABLE>

Some More Special Features from the Scraper

The result table is to be parsed with the PageArea template and content elements. However, for the inclusive scrape to get the records including notes, the developer proposed: “You can use the Visual Web Ripper XPath method SPAN in the Page Area template to capture the extra info, and you then have many options to format the output.” The link to that feature description is here. I couldn’t master this feature myself; this video will guide you through the steps to master it.

The Visual Web Ripper again showed the outstanding features that work well for unusually tricky content.

Content Grabber

Firstly you need to apply transformation to the whole target item with id “case_textlist”. If prompted, choose Transform page command. Use the following transformation code. The code:

replace with an empty string;
the code:


replace with an empty string; The code:


 

(.*)

replace with:


 

 

$1

The code:


replace with:


Also apply the following regex:

(\s){2,}

to replace with:


The original list becomes a nice html table. Resulting html table: POPULATION 8,244,910 3,819,702 2,707,120 2,145,146 1,536,471 1,469,471   New York(City of New York)Los AngelesChicagochange: +0.43%HoustonPhiladelphia(City of Philadelphia)Phoenix           

 

CITY
————————

After that we proceed to extract items as usual with click-select of Content Grabber visual scheme.

Results:

content grabber test drive list resultes If we want to skip notes we need to edit “City list” command (one right after trnsformation). Click Anchor to define some anchors to be extracted: content grabber add anchors

Only bold rows

If we want to extract only cities in bold we should not remove their markes at transformation stage. Into the script we will add some markup to the figures. So in this case the transformation code would be:

([\d,]+)<\/b>
replace with $1

(.*)

replace with

$1

replace with (\s){2,} replace with

In the above code first two lines do wrapping bold population figures with bold markup (replace with <b>$1</b>) so that the scraper would later recognize them as distinct fields as opposite to not bold ones.

Basically, Content Grabber has done well with extracting poorely marked up text list by transforming it into an html table.

Mozenda

The Mozenda extractor sees the text list as one whole element, so I grabbed it and then tried to parse further through ‘Refine Captured text‘. With this Refine toolbar I did the extraction easily:

  • Choose ‘Capture definition’
  • Define the Regex expression (don’t forget to tick ‘Regular Expression’ box). The Regex to capture content into named groups is:
    \b(?<City>([A-Z]{1}(\w+\s)+)+)\s*(?<Population>\d[\d\,]+)
  • Click on ‘Capture List’ to select multiple matches of the Regex (this step was directed by support).

For regular expressions usage in Mozenda, read here. The bold records I got easily by adding <B> and </B> (case sensitive) at the both ends of the Regex respectively, that is now like this:

<B>(?<City>([A-Z]{1}(\w+\s)+))\s*(?<Population>\d[\d\,]+)</B>

Cities Including Notes

To capture the cities, including notes, Regex took some trial and error. For Regex extraction based on HTML (HTML mode instead of plain text mode, Mozenda will strip HTML markup prior to custom refine), with the above mentioned steps, switch to HTML view:

The final Regex is:

\b(<B>)?(?<City>([A-Z]{1}(\w+\s)+))\s*(?<Population>[\d\,]+)(</B>)?(<BR>(?!((<B>)?[A-Z]{1}(\w+\s)+))(?<Note>.*?)<BR>)?

This Regex secret is in this part (?!((<B>)?[A-Z]{1}(\w+\s)+)), which imitates Regex inversion (not City like text) through negative lookahead structure.

Basically, Mozenda provides needed functionality for tough cases, if you are skilled in Regex and XPath.

OutWit Hub

OutWit Hub initial page upload has found nothing on the target page. So, I immediately composed a scraper with Regex to capture the content. A stumbling block is that scraper Regex outwork (in ‘Format‘ field) is not case sensitive. After my complaint about the bug, support explained the cause to me and they inserted a directive (#caseSensitive#) that made the scraper extract properly. Directives (only in pro version) alter the normal behavior of the scraper. They can be located anywhere in the scraper and will be interpreted by the program before any other line. They are identified by # characters.

To scrape only bold-font records, I added <b> as the Marker Before to the City scraper field and </b> as the Marker After to the Population scraper field.

Being confused by getting records including comments for scrape, I turned to support and they helped by coming to this project with another approach. The \x26 stands for ‘&’ for the new 3.0.1.21 release that is in beta testing. In that, there is no use of #caseSencitive# directive so <b> equals to <B>. To avoid the header ‘CITY          POPULATION‘ scrape, the patterns start to be executed after the #start# directive that equals  ‘—————‘. 

To fetch the bold records, the scraper owns the last, 5th line. This bold records extraction takes everything in bold and then splits it with a separator (&nbsp;)+ into two columns City and Population. This split functionality is a feature only of the Pro version.

My impression is that OutWit has a very good Regex interface (supporting split and repeat functionality) that allows it to scrape very difficult cases, without support help.

Screen Scraper

I composed a project with this extraction pattern:

>~@City@~~@Space@~~@Population@~~@Space2@~~@Comment@~<b

It commands the scraper to extract data into the City, Population and Comment extraction tokens and then data are saved in session variables (Space and Space2 tokens are ignored, but they working as delimiters).

For the City token, I defined \b([A-Z]{1}([a-z]\s?)+)  regex pattern (similar to what I composed initially). See the image at right.

For the Population token, I defined \d[\d,]+ regex pattern.

For the Comment token, I defined ([^&]+?)? regex pattern.

For the Space token, I defined [&nbsp;]+ regex pattern.

For the Space2 token, I defined (</b>)?<br /> regex pattern.

A result for a single record emerged as:

City=San Jose
Space=&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;
Population=967,487
Space2=<br />
Comment=945,942 in 2010

To get rid of the front standing </b><br /> before the Comment token, I added the Space2 token. That token captures and excludes those HTML tags. The Space and Space2 tokens are needed, since the scraper does not support non-capturing passive groups (?:). They are not to be stored in the session variable.
When mastered, The Screen Scraper is a powerful tool related to Regex‘s application.

Helium Scraper

I failed to make any kinds, but the support has done the project using java script and JS-Gatherers.
The whole process of project composure was recorded in 2 videos, video 1 and video 2.
To highlight the ground steps that are needed to extract in this kind of tricky test:

  1. Online pre-made Force elements into same row” was imported (at File –> Online Premades)  in order to accomplish this. Please see the project’s description on the right side of the Online Premades dialog.
  2. Wrap action forces data into single rows. We include it in the Do Wrap action tree (Execute Action Tree -> Wrap).
  3. Only after running of the Do Wrap action tree, we define the kinds: Line, Bold line and Notes.
  4. Add the Do Wrap extraction into the every action tree before the actual extraction into tables process.
  5. The javascript gatherers JS_IsNote and JS_IsData were written manually. You can find them at Project –> JavaScript Gatherers. (video 1)
  6. Two text gatherers were created (with the tool at Project –> Text Gatherers) to separate city (JS_City) and population (JS_Population). (video 2)

The JS association works poorly for text list, where regex should play a major role. Even here, the Helium Scraper has shown its might. The scraper developers have issued a new beta where this project was made and run.

On my opinion, one should be “almost a developer” to master and apply those techniques. We hope this review with videos will help you to get into Helium Scraper advanced techniques.

WebSundew Extractor

I failed to do any extraction with this scraper. This scraper’s support composed the projects for this case. These projects work, but I couldn’t reproduce them (compose them by myself).

The visual scraper is expected to be more improved. Then, users may do tricky tasks like this one.

Easy Web Extractor

With this scraper, I failed to do much, since its selection wizard is XPath oriented (with some extra scripts functionality). It differentiates records in <b></b> tags only. So, even those I failed to extract, the support having not replied to my help request.

Conclusion

This Test Drive stage has exposed scrapers that are able to do any extraction task. Those scrapers include both XPath and Regex functionality, allowing use by someone who is not a programmer (exception is Helium Scraper that works using JS gatherers). Of course, if a scraper has an ability to run custom scripts (Visual  Web Ripper, Screen Scraper, Helium Scraper and others), it can scrape nearly everything. But in this case, the user should have programming skills. In such cases, the scraper’s rating is less in comparison to those scrapers that do the same without invoking custom scripts.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.