Categories
Web Scraping Software

TEST DRIVE: Table Report

In this post, I’ll start to share our experiences with different web scrapers on the Testing Ground project. The first test, which I thought would be the simplest one, proved to be irksome and discouraging. With a struggle, I completed the test drive on Table Report. This test evaluates the ability of different scrapers to […]

In this post, I’ll start to share our experiences with different web scrapers on the Testing Ground project. The first test, which I thought would be the simplest one, proved to be irksome and discouraging. With a struggle, I completed the test drive on Table Report. This test evaluates the ability of different scrapers to cope with difficult tables, like merged tables, missing values and so on. See the test table layout:

Overview

Basically, the scrapers are well able to extract a simple HTML table layout, but extracting a sophisticated table layout is much more difficult for them.

Common problems

  1. Merged cells outwork
  2. Tables with indefinite number of columns to scrape
  3. Table headers to be extracted
  4. Failure to extract only a certain year’s info from a table

The Overall Results


StarsAvg.
solving time,
seconds
Fastest
solving time,
seconds
Performance,
%
Notes
DeathByCaptcha
411696,8Dec. 2019
2Captcha
631595,2Dec. 2019
CaptchaSolutions
1113778Oct. 2017

Dexi.io

The extractor Dexi can complete 1, 3, and 4 task points without Javascript code. The second point of the task (Tolerate merged columns and rows) is performed using Javascript code (see the ‘Execute JavaScript’ steps in the figures).

How the extractor deals with “indefinite number of rows and columns”

For saving an indefinite amount of data for tables with undefined number of columns, one-dimensional or multi-dimensional arrays can be used.

How the extractor removes columns and rows

Using the first Javascript code (in the first figure), the attributes ‘colspan’ and ‘rowspan’ are removed from cells with an item or amount of goods. Corresponding empty cells are created on free places with ‘-‘.

var cols = document.querySelectorAll("table tbody [colspan='2']");
var tableTr = document.querySelectorAll("table tbody tr");

// removal of cells with attribute 'colspan'
for(var i = 0; i < cols.length; i++)
{
    cols[i].removeAttribute('colspan');
    cols[i].outerHTML = cols[i].outerHTML + '<td>-</td>';
}

// removal of cells with attribute 'colspan'
for(var i = 0; i < tableTr.length; i++)
{
    var td = tableTr[i].querySelectorAll("td");
    
    // extra lines are skipped
    if(td.length <= 2)
        continue;
        
    for(var j = 0; j < td.length; j++)
    {
        if(td[j].hasAttribute('rowspan'))
        {
            var long = +td[j].getAttribute('rowspan');
            td[j].removeAttribute('rowspan');
            
            // inserts missing cells in the following lines
            for(var k = 1; k < long; k++)
            {
                var aTd = tableTr[i+k].querySelectorAll("td");
                if(j >= aTd.length)
                    aTd[aTd.length - 1].before('<td>-</td>');
                else
                    aTd[j].before('<td>-</td>');
            }
        }
    }
}

How to extract the total amount for the year

The second Javascript code (in the first figure) adds the class ‘total’ to the rows with the total amount for the year.

var tableTr = document.querySelectorAll('table tbody tr');

for(var i = 0; i < tableTr.length; i++)
{
    var m = tableTr[i].innerHTML.match(/<b>Total for [0-9]{4}<\/b>/);
    
    if(m)
    {
        tableTr[i].setAttribute('class', 'total');
    }
}

So, the third Javascript code (in the third figure) will be able to find the total amount for a given year without going through the whole table.

let regex = new RegExp("Total for " + $output.Year);
let regex2 = new RegExp("<b>[^a-zA-Z0-9_][0-9.,]+<\/b>");
var total = document.querySelectorAll('.total');

for(var i = 0; i < total.length; i++)
{
    var m = regex.test(total[i].innerHTML);
    
    if(m)
    {
        var res = total[i].innerHTML.match(regex2);
        $output.Total_amount.total_for_year = res;
    }
}

How to cause the robot to extract only a certain year

To obtain data for a certain year, the action ‘Assert output equals‘ (at the second figure) is used. The specified year is compared with the one recorded from the table on the second branch.

I have split the whole extractor scheme into the separate figures for the sake of readability:

figure 1

Table1

figure 2
Table2

figure 3

Table3

figure 4

Table4

figure 5

Table5

OutWit Hub

The OutWit Hub did the best, far superior to its competitors, except Visual Web Ripper, in table scraping. It features grabbing tables or lists as they are and it immediately allowing them to be exported. Therefore, no time expenditures on project creation is one of its basic merits. OutWit Hub does outwork the merged cells, but it failed to keep table header column names (ProductItemsAmount, etc.) in separate columns. See the complex header:

Quarter Product 1 Total Amount
Items Amount

The program could extract only a whole table, not certain rows from it. But the upcoming release, Version 3.1, will allow multiple selection criteria in the bottom panel of all pages of the software and then it will be possible to tell it to choose rows after “x” and before “y”. OutWit Hub also does not create merged cells (like for ‘Total for 2000’ value) in the final table, though it grabs the values from merged cells, multiplying them into separate cells. Yet, unlike most other scrapers, it doesn’t shift values after merging cells. Instead, it generates needed cells and auto fills them. It also excellently parses vertically aligned merged cells. The merged cell is considered as one in a row for most scrapers, so data shift is unavoidable:

There is no data shift with OutWit:

Visual Web Ripper

Visual Web Ripper performed better than OutWit, with much time wasted, though. The only way Visual Web Ripper is better than OutWit is it can be set to extract only a certain year’s data from a table. Initially, I lost much time trying to go across columns in PageArea patterns while creating a project. Visual Web Ripper support worked on it, providing results I never imagined would be possible. The support team suggested  “users would simply use a direct copy/paste into Excel which would yield a perfect result.” However, it didn’t only grab HTML data and save it into Excel; data may be saved to a structured DB or other format, as well.

Support team: It’s not impossible to use the software to copy the HTML table into Excel, but it’ll not yield a perfect result like a direct copy/paste into Excel.

You need to follow these steps to extract a simple copy of the HTML table:

  1. Use a Transformation element to denormalize the table.

  1. Use two PageArea templates, one for each direction in the table (rows and columns).
  2. Configure the second PageArea template to extract data into the parent table instead of a separate table.

The “Save data method” is on the Advanced options tab. Visit here if you want to know how to configure it:

The VWR could be programmed to get part of a table from a whole table. For this, you need to use an Input Parameter to get a certain year, so you can specify the year you are interested in before running the project. Define yearArea template with input parameters right after the table denormalization template:

The result is awesome:

Quarter Product 1 Product 1 Product 2 Product 2
Quarter Items Amount Items Amount
2000 2000 2000 2000 2000
Q1 34 $6,762.60
Q2 44 $8,751.60 50 $9,945.00
Q3 59 $11,735.10 63 $12,530.70
Q4 70 $13,923.00 73 $14,519.70
Total for 2000 $238,481.10 $238,481.10 $238,481.10 $238,481.10

 

Content Grabber

The scraping the table, especially with certain years took me more efforts.

[box style=”gray note”]To set up all the following commands you need to switch the Content Grabber layout to Expert Layout in the Application Settings ribbon menu. (Otherwise, all the advanced features are hidden to not confuse novice users).[/box]

Choose a case (link)

First we create a Navigate Link command that will click on a Case link (ex.: 1 prooduct, 1 quater). We can see its xpath in the command config: the Properties tab:

content grabber Xpath properties

How to set its xpath to choose particular case?

The particular case might be chosen thru global variable case inserted (see get-global-data(‘case’)) inside of the xPath expression:

//div[@id=’caseinfo’]/ul[2]/li[position()=get-global-data(‘case’)]/a

You might have mentioned also the position() function in it, the custom Content Grabber xPath function. The custom Content Grabber xPath functions might be found here.

We may change xpath expression of the command at the Selection tab of upper ribbom menu (there you might also add a second, third xpath for the command):Content Grabber Section tab of ribbon menu to change xpath
So now we set an input parameter case at the Agent Settings tab -> Input Parameters. This case number will be later fetched when Case command is executed:
Content Grabber input parameters

Denormalize table

After you’ve defined Case command, you definitely need to do a table transformation. As you select the table you apply transformation command, which should be Denormalize HTML table. This removes all the joined columns/rows and make it well scrapable!

Content Grabber denormilize html table
To extract all the table info we make two commands: Rows and Columns.

How to choose rows of a particular year?

Inside a Rows command let’s reconfig it. For this we change its xpath expression (at the Selection tab) for the following:

The first xpath for selecting certain year (when year input parameter is set)  looks like this:

//div[@id=’case_table’]/table/tbody/tr[not(@class)][td[1] != ” and td[1]=get-global-data(‘year’)]/following-sibling::tr[position()<node-position(../tr[td=”])-1]

The xpath’s function get-global-data(‘year’) gets the year input parameter, which is then used to select all rows between that year and the next empty row.

Then add the second xPath for selecting all the rows (when the year input parameter is not set):

//div[@id=’case_table’]/table/tbody/tr[not(@class)][td[1] != ” and get-global-data(‘year’) = ”]

This xpath simply selects all rows when the year input parameter is empty.

Content Grabber extended functionality bannerSo the above mentioned xpath expressions were provided by Content Grabber support. Kind of complicated. Sure, these xPaths include the Content Grabber extended functionality, ex. node-position()

One can see a command’s xpath expressions under Properties tab -> Web Selection -> Selection -> Paths -> Path.

Content Grabber rows xpaths

All is done. Wonderful! Content Grabber has manifested the advanced functionality as a still being a visual scraping software.

Mozenda

Mozenda isn’t convenient to scrape complex tables. It failed not to shift due to merging cells and processed indefinite table columns, though you can tune it to get certain rows from the table.

Mozenda support explained that unstructured data (like merged/unmerged cells) don’t fit a pattern, but “it could capture data indefinitely if the data followed the same pattern… Mozenda can select all the items in an infinite grid if the pattern holds true…. but I don’t think that we will be able to gather the data in these tables without some distortion or having a finite amount of columns or rows”.

Mozenda proved to be able to scrape data from tables that may change dimensions. The support serviceman has sent us an Agent that does it. Yet, output is as list of items, rather than a table.

If you want to get only a certain year’s data scraped in a project, as you build an Agent in Actions area (below left corner), choose ‘Begin item list – <custom name>’  and, then, in Tools area (upper left corner) in ‘Selected action‘ sub-area, click ‘Refine list‘ and tick ‘Only include items at the following indexes‘ (there you define those rows pertaining to a certain year). See the image below:

WebSundew Extractor

With this scraper, I failed to make a project that could extract something, and soon I gave up and asked for help. The support replied: “The scraper is for fixed dimensions data only”. They composed a project to extract, but it wasn’t up to standard.

  • They created 3 patterns for extracting Year, Quarter and the Items & Amount respectively.
  • The joined pattern was made then.
  • Afterward, ProductN was added to it.

The output resulted in a mere listing; no ‘Totals’ were scraped:

Year Quarter Product Items Amount
2000 Q1 Product 1 16 $3,182.40
2000 Q1 Product 2 21 $4,176.90
2000 Q1 Product 3 25 $4,972.50
2000 Q1 Product 4 30 $5,967.00
2000 Q1 Product 5 34 $6,762.60
2000 Q2 Product 1 27 $5,370.30
2000 Q2 Product 2 32 $6,364.80
2000 Q2 Product 3 36 $7,160.40

The developers promised to improve the product in this area.

Helium Scraper

The Helium Scraper did well, after I asked support for help, although it was kindly given. Helium Scraper outworked merged cells. For this, the special premade JS gatherer needed to be uploaded first (ask support). Then, check mark Project –> Options and Select Property Gatherers under the kind defining tab.

After some communication with the support team, they sent me the project file that completes the extraction.

The Helium Scraper Team: “The database-way to store this kind of information is to use multiple related tables.” Two tables are related by year: the Details (Items & Amount) data table and the Year (Totals for each year) for a year table (see the image below). This information can then be exported as an XML, HTML or any other format by using the Custom Export feature.

Multi-join is easy. After extraction of data in tables, in the database panel, go to ‘Queries‘ tab, write a query and run it. The following image shows how to run an SQL query on two tables extracted to create a joined table:

Basically, you’d need to create kinds for every product. As far as setting up the indefinite table columns for scrape in a project, you need to create additional/extra JS gatherers, which does require you to have some advanced skills.

When I inquired for year-specific information to be extracted from a table, the support proposed the creation of a new kind, kinds intersection (with the Create Set kind button in the kinds editor). This kind will only select cells under the year that matches the year in the YearFilter kind, which only has one property, JS_Year, equal to a year of interest. After that, another actions tree will extract only one year, because it uses the filtered version of a kind:

My impression is that for a difficult scrape with Helium Scraper, you’ll need to ask the support center.

Screen Scraper

A Screen Scraper project is a challenge to compose. I requested help from the support and soon Jason (a support member) had sent me the project with patterns for the extraction.

To get the indefinite column number, the support team composed 2 patterns to be applied sequentially and the following script to save data to a log file. This regex-like approach with unfolding patterns improved my opinion of Screen Scraper. See the patterns:

The Screen Scraper does extract the data, but also has the merged cells issue; totals for each year and year designation, themselves, were not defined. Truly, this task is not for a casual user. The result is in the session log (see in red boxes):

Screen Scraper support promised “to map each row to account for rowspans (colspans are easy)”, but hasn’t replied yet.

Web Content Extractor

Web Content Extractor is good to scrape fixed width tables, but without some expected features (merging, etc.). The support team admitted: “Sorry, our software is not able to extract data from this type of table.” See the obvious bugs after the sophisticated table layout scrape:

WebHarvy

After some struggle to harvest a table, I emailed SysNucleus and developers replied, “…Support for scraping data from HTML tables is low in WebHarvy. The tool has been designed with more focus on ability to scrape data from normal listings in webpages (search results), again focusing on ease-of-use.” It better works to “extract data from a site like yellow-pages where data is displayed in a well formatted manner”. See the poor result:

Web Data Extractor

Web Data Extractor is focused on collection of lists of email, URLs and etc. The output was always as list of items. I inquired of the developers about tabled output and the Spadix software specialist admitted, “Information in any case will be given as list of items. Current version doesn’t have this ability.” See the result at right.

Easy Web Extractor

Easy Web Extractor did poorly as well. The data shift is obvious and it also failed to get the info for totals. The data were not found in some tables (too large and too small) as the project was applied for several tables-links. The developers’ team said: “Easy Web Extract is not designed for excel-like table extracting. That means EWE doesn’t try to detect and convert any number from the table.” In spite of that, I think, for a simple table layout extraction, you can use this tool. See image below:

Conclusion

After the first attempts in a Scraper Test Drive, we might nominate OutWit Hub and Visual Web Ripper for consistent HTML table data scrape, but Content Grabber has done the best of all. Content Grabber has manifested as the visual yet highly developer adjustable scraping software. It has supported tables with indefinite number of rows and columns, tolerated merged columns and rows, parsed currency formats and was able to extract only a certain set year from a given table.

Most of the scrapers require composition of a new project for each table with a different column number that diminishes a project value. Since most of the scrapers are designated to extract data from commercial web catalogs (e.g., yellow pages or eBay), their scope of work doesn’t include tabled data. Three web scrapers openly admitted not being able to extract such tabled info. OutWit Hub stands out due to its capability to scrape tabled and listed data. Content Grabber and Visual Web Ripper have proved to be a monster scrapers in comparison with the others.

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.