Helium Scraper is a visual data extracting tool standing in line with other web scraping software. This data extractor uses a search algorithm for scraping which associates the elements to be extracted by their HTML properties. This differs from the general extraction methods for web scrapers. This feature works well in cases in which the association between elements is small. For example, if you want to scrape the search engine results it’s not easy to get the needed info from them using only XPath or Regexes. This scraper facilitates extraction and manipulation of more complex information with the aid of JavaScript and SQL scripts. It’s exceptionally good for visual inner join multi-level data structures.
Overview
Helium scraper will impress you with its simplicity and robust operation. One of the outstanding features of this scraper is the easy-to-set-up multi-level extraction, the data of which might then be turned into related tables (watch a video). This involves running SQL queries against extracted data. It looks awesome! The only weak link of Helium Scraper is checking/verifying if a “kind” is properly defined; after creating a “kind” one often needs to check the relationships it has to the targeted extraction data.
Workflow
The object model in Helium Scraper consists of:
- Kind – associated set of webpage items for extraction
- Action – the action to be performed upon pages or kinds
The work with Helium Scraper starts with browsing to a target page and defining a kind(s). Activate the selection mode by clicking on Selection Mode button. Press CTRL + click on two or three of the target items which highlights them. Choose Kinds tab at the left bottom area and click on the Create kind from selection and name this kind. In the kind dropdown list you can see the list of properties that are common to the elements associated to this kind, defining those properties for this kind is of the scraper algorithm.
If you click on the Select Kind in Browser button, all the elements in the current page that have the same properties will be selected.
Now, the important thing is to verify if the kind made is really “this kind”. For this, uncheck Selection Mode and navigate to the next page (ex. by “next” link). There, you click on Select Kind in Browser button to see what the program has highlighted. If the result fits your expectations – kind is correct. Otherwise you’ve created your “something” kind. Some of the kind’s properties belong exclusively to that element(s) from the original page with no association needed. Kind of smart… How to get out of the pit? To fix this, activate selection mode, select the additional element(s) in the browsed page and click on the Add Selection to this Kind button. The list of properties should have been limited to those properties that are common to both sets of elements. The kind’s properties are now narrowed for more precise scraper detection. It seems tiresome, but you need to check each kind for accuracy and fix any discrepancy. In this way, you’ve set up all the elements for the scrape including “Next” link for navigation. Check it too.
The Actions to be set is simpler. Push the Actions button at the left bottom panel and expand the Actions tree by clicking on it. Actions are for navigation, extracting and other purposes. They are multi-level (having child actions). Premade actions (premade online projects/templates) may be inserted into them (New Action – Execute Action Tree – More). Select New Action – Extract. You will be prompted to choose from the set of kinds to extract, and as you do it, the program creates a corresponding destination table for output. For pictures download, set its Property to “SrcAttribute”, and check Download option. For navigation to next page, define Navigation action with “next” kind you previously defined. The iteration number is to be defined manually, according to the number of pages you want for extraction. For this, click “Repeat … times” root action and change the iterations to the required number.
For advanced use, you may use JS-premades (Actions – Execute Action Tree – More) or add any self-written script (Actions – Execute JavaScript) at Action tree.
Now, run the project. At the Action bar, there is Quick Data View button. Drag any Extraction action into the SQL script area and the program will generate an SQL script to select from the corresponding table (see the picture above). The following picture shows the new table defined with kinds as column headers.
For data export, turn to Database (push Database button at the left bottom panel). Here you may perform custom export into CSV, MySQL formats or customized XML format, as well as create or append MS Access DB instance.
You may run SQL queries against the data scraped, outlook the tables or make customized templates for export. The scraper provides a means for visual inner join multi-level data structures.
Multi-join is easy, after extraction of data in tables in the database panel go to ‘Queries‘ tab, write a query and run it. The following picture shows how to run a SQL query on two tables extracted to create a joined table:
Scheduling is possible by using the Windows Task Scheduler and running Helium Scraper from the command line. Proxying is also possible. Also you may use JavaScript for SQL editor.
Summary
The Helium Scraper is a well-developed data ripper suited for multiple purposes. It’s able to scrape with less structured data because of the advanced association algorithm. Multiple format data export is well arranged. Another remarkable feature is the access to online pre-built templates.