The next stage in the Scraper Test Drive is to test the scrapers on their ability to parse Block layout. This test evaluates the ability of different scrapers to cope with difficult blocks layouts, especially those in which there is no direct HTML association among the data presented on a screen.
Here are the test blocks layouts:
For this test, we’ve created 2 cases, the first being a simple blocks (<DIV>) layout. With the second more difficult case, the difficulty is that there’s no direct DOM association for each single record. See the code snippet here:
<div class="left"> <div class='prod2'> <div class='name'>Dell Latitude D610-1.73 Laptop Wireless Computer </div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB </div> <div class='prod1'> <div class='name'>Samsung Chromebook (Wi-Fi, 11.6-Inch) </div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB </div> <div class='ads'>ADVERTISEMENT</div> </div> <div class="right"> <div class='price2'>$239.95</div> <div class='price1 best'>$249.00</div> <div class='ads'></div> </div>
Not a real case?
In the process of testing, one support team complained of Case 2 not being a real case. Yes, its probability in the real world is small, but we did face similar cases (and supposed that someone may want to develop a scrape-proof site).
Moreover, we want to find scrapers that are specialized for hard cases like this one. Importantly, I’ve found the Visual Web Ripper’s non-standard XPath functionality has easily overcome it (see the details here).
Note that building a project/agent that would effectively scrape both cases turned out to be an impossible task for all the off-the-shelf scrapers.
The overall results[table “5” not found /]
Work with the task ‘Price List’ is divided into 3 parts.
- The first part works with ‘case 1’.
- The second part performs the first point of the task with ‘case 2’.
- The third part performs 2 and 3 points with ‘case 2’.
For any task, we create output fields where the data will be written. Create 3 fields: ‘name’, ‘description’ and ‘price’. Set the type ‘text’ for each field and mark them as ‘lists’ (array inside Dexi).
The red text by the extractor steps shows the original name (type) for that step.
Each branch on the picture/image corresponds to one of 3 subtasks. Each branch uses a different CSS selector.
Any scraper in OutWit to be applied to captured HTML text works by Regex. The HTML block layout of the test cases (1 or 2) must be approached in a completely different manner. The only way to properly extract is to create several OutWit scrapers (possible only in the Pro version) that will extract the corresponding data separately. I created different scrapers for different cases.
The scraper I created for case 1 block layout consistently extracts all the items.
Also, I defined several macros based on the same scraper. These macros apply the scraper and sort the scraping results by a certain rule to extract only the needed types of data (e.g., best price, discount); see the image below. Each macro works for only one type of sorting.
Also, in a macro, I’ve set the Dig option to browse through all the page links to outwork all 4 price lists given, so I get all discount price items in the same target list:
You may see here how to handle a scraper with a macro.
The extraction of the second case blocks layout failed, so even applying several scrapers didn’t succeed.
Overall, this scraper does a simple catch, making no attempt in sophisticated cases.
WebSundew performed excellently, completing each task by a different agent. Previously, the support team had asked us to provide new challenge pages for testing, so this time we sent this block layout to them. Soon, they sent us the finished project with 6 agents, each one for a separate case and task. The agents worked smoothly for all the different price lists. See the image below: the agents and data patterns are in red.
To capture data in an agent, I click ‘Capture‘ on the front tab and in the popup window, I choose ‘Data iterator pattern’ (shortcut: in main menu Patterns -> New Iterator Pattern, see image at right). Then, I simply point in the browser to the first field related, name it and click ‘Find‘ to find a pattern that should match. I choose among the patterns the engine found by clicking on rows results. After that, I click ‘Next‘ to complete the pattern building.
For the second case, I got help from support to define the advanced attributes for extraction. Here is a link to the video on how to perform it. The scraper user surely needs to understand the HTML and DOM-tree concept to be able to compose a project for such a tough case:
Basically, one needs to be familiar with HTML and XPath/XNode to be able to master this scraper for tough cases. This scraper succeeds in applying advanced pattern queries.
The approach to the blocks layout scrape is very similar to the one for table layout one. I created kinds for all fields. In the Actions, I also extracted special fields separately, into other result tables (for example, ‘discount’, see in red in the image below). The way to extract only certain feature records (best price, discount) is through an SQL query, joining the general result table (‘Case1_all_items’) and the special fields table (‘discount’) into one table. This is how I sort the records to get those related.
The SQL query to join two result tables by ‘discount’ field:
SELECT * from Case1_all_items INNER JOIN discount ON Case1_all_items.discount = discount.discount
and the query run results for the two tables for discount prices (in red oval):
In my opinion, this scraper allows you to compose extraction projects with more visual ease, compared to the others.
Visual Web Ripper did very well, the support team having kindly presented us with projects for the blocks layout.
The scraper was especially good in refining the details text, which was not the case with other scrapers that jointly show both name and details as description:
<div class='prod1'> <div class='name'>Samsung Chromebook (Wi-Fi, 11.6-Inch)</div> 1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome </div>
To do it, check the ‘Tag text’ radio button in the Options area for details extraction. Selecting the “Tag text” option ensure I won’t get text from the inner HTML tags:
The specific records extraction was done with 2 steps instead of one in all items extraction.
- Create a PageArea template to define target records
- Define content elements for extraction
In order to compose a PageArea template to define the area with the specific data (best price, discount), you will need to explore HTML and write a XPath query. The anchors will be class names ‘dist’ and ‘best’. Just start a new PageArea template and in the Options area choose ‘Set XPath manually’.
Here are the 2 queries for ‘discount’ and for ‘BEST PRICE’ respectively:
For the second case the procedure is similar:
- Create a PageArea template to define target area
- Define content elements for extraction
But, to extract one complete record is not that simple for this layout. Support stated: “There is no built-in support for this sort of table layout… and most normal users probably wouldn’t be able to do this on their own”. As a developer, I was glad to see this solution in action.
- PageArea template is defined to the determinative ‘Price‘ fields thru XPath. See below the XPaths for PageArea defining (1) all the records, (2) only those with discount price and (3) only those with best (red) price:
- The ‘Title‘ & ‘Description‘ fields are outside the PageArea template, but we still define them as extraction elements through XPath. Here we make the selection manually, because we need to use the ‘@root-node-position‘ attribute, connecting the current price field node with the rest of this record:
This will extract the Price fields and, corresponding to them (with the same position: @root-node-position), Title and Description.
supported by Visual Web Ripper.
You can read more about non-standard XPath methods, attributes and
axis supported by Visual Web Ripper here (only for registered users).
The Contetnt Grabber has again prooved its being highly visual scraper. Within minuts I’ve got point and click UI to fetch the Case 1 block layout. If you are not very clear of how to proceed, watch this Content Grabber 60 sec. video of Sequentum to get clear of how to do it.
Case 2 turned to be harder. I could point and click to select 2 sets: “Product & description” and “Prices”, yet in the result data set they were not matching each other cause of poorely matched html structure.
|Dell Latitude D610-1.73 Laptop Wireless Computer||2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional|
|Samsung Chromebook (Wi-Fi, 11.6-Inch)||1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome|
|Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)||2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion|
|Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)||2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit|
|HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)||2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit|
|ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)||1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit|
|$549.99 discount 7%|
The support has quickly helped us: By default data from multiple lists are added together, but you can choose to merge the data by setting the option “List Sub-Container Export Method” to “Merge”.
Take a look of how to set this option:
I’ve appreciated such an visual yet all-inclusive scraper!
I’ve created an agent to scrape within 3 minutes to extract all the data on a page. As for the special records, I extracted ‘discount‘ text and ‘best price’ value using XPath into additional custom fields:
XPaths for the best price values and discount text:
(For the XPath usage in Mozenda Agent builder, see a walkthrough.)
Then, using the Refine list action (right-click ‘Begin item list – <custom name>’ -> Refine list) like in the table scrape, I sifted records based on the presence/absence of certain values in those fields.
I tried to define the fields with using XPath with partial success, with these XPaths:
Price values are loosely associated with Name and Description (in the different DOM containers). Therefore, I couldn’t define any custom XPath to grab them.
In my view, Mozenda does offer both XPath and Regex functionality, but not for such hard cases.
The general gathering (all records) went smoothly with defining the fields in extraction patterns, the transformation script being used to retrieve Description field. As in the Mozenda test, I retrieved values for best price and discount in additional fields and applied filtering based on their content. The filtering is by (Results->Filter->Edit) or press button. See the image below:
The Case 2 as not a real case support said is not a subject to test for Web Content Extractor.
The Screen Scraper does force you to handle Regex for extraction patterns and sub-patterns (similar to composing a scraper in OutWit). This is like a small task and if the next case/page is of another HTML structure, I do need to redo patterns over. For this case, it took me over an hour to figure out the match patterns. First, I fetched the whole record with this pattern:
Then, within this DATARECORD, I do the sub patterns extractions to Comp_name, Description and Price respectively:
</div>~@Description@~<span style='float: right'
Still, since Case 2 is closely following Case 1, some records were fetched with this extraction patterns, but they might be omitted at the stage of data export. The special records extraction (discount, Best Price) is relatively easy, since each type of record is of the same HTML code. Best price pattern:
<div class='name'>~@Name@~</div> ~@Descrition@~<span style='float: right' class='best'>~@Best_Price@~</span>
Discount price pattern:
<div class='name'>~@Name@~</div> ~@Description@~<span style='float: right'>~@Dicsount_Price@~</span> <div class="disc">~@Discount_value@~</div>
As for this lowly associated HTML layout, I failed to compose extraction patterns.
The Case 1 layout seems to be very natural to this scraper. Within minutes, I did the extractions project, including transformation script to sort the description part from the title. The scraper has really proved its name as an easy extractor. For special records extraction, I reassigned the first record to be the best price record in order to specify some additional record features. Additionally, I defined the “BEST PRICE” field. For this, I checked the corresponding box in HTML DOM tree: see the screenshot at right. I also specified it as the required field and text containing ‘BEST’ at the ‘Other Extract Clue‘ tab. The extraction went smoothly; the scraper automatically found similar links (crawling rules stage) and did extraction in all of them.
In this case, I really studied the DOM structure to detect and select corresponding fields. Yet, the first record to choose I failed, since the scraper gives no way to choose for one record the separated parts from 2 different DIV containers. See the picture: The scraper is very adept at straight structured data extraction, but not at data hardly associated.
This TEST DRIVE stage on the Blocks layout has shown the general ability of scrapers to work with this layout, blocks being now a usual HTML layout for putting structured data on websites. Selection of marked records from the data array was done pretty well by all of scrapers. To compose a project/agent in the simple visual way, it took me a minimal amount of time, while the ‘big boys’ forced me into long trials and, in some cases, to turn to support for help. The complicated Case 2 has exposed those with hidden attributes; both Content Grabber (excellent, highly visual solution), Visual Web Ripper and WebSundew Extractor have passed this test. The VWR turned out to have an additional functionality that could be applied for this case. I believe the results for the Blocks layout test case are of use in choosing an off-the-shelf web scraping product. Good luck!