WebSundew is a visual scraping tool that works for structured data extraction. This screen scraper is designed for high productivity and speed data ripping. The Enterprise edition allows the scrape to run at a remote Server and publish extracted data through FTP.
Overview
WebSundew is the solution that allows you to handle web content without using scripts. It’s good for daily use and, moreover, it’s rich with features for doing all kinds of webpage scraping. Custom support can offer you a Free Extraction Project; after you submit the details of the data for scraping, they will contact you in 2 days with your project ready. Integration of WebSundew with other programs is possible through running the agent from the command line with parameters.
Workflow
The screen scraper work starts with creating new project. A Project consists of Agents that crawl through webpages (states), Data Patterns to designate data for scrape, [Next] Page Patterns to navigate through next page’s links and Captures to modify the captured data. Data source is to define output data formats. The picture (at right) displays the typical layout for a project, with red arrows marking the basic components of it.
To create a new agent, navigate to the target webpage you want to start extracting from and click ‘New Agent’. In order to visualize a crawling net, there is the Agent Diagram at the left in the browser area:
Click the ‘Capture’ button to create a Data Extraction Pattern and complete the selection with the ‘Finish’ button:
Then, define Data Iterator Pattern. After clicking an item and adding it in Wizard, choose from the patterns (number of rows, program prompts you) which return data extracted from this page:
If any Ajax/JS elements are present, you may define those elements’ load delay by right clicking on them in Agent Diagram and in the pop-up menu, choose Add – Pause. If needed, pass input parameters to Agent, including file with parameters. The program will read the parameters from the file and run the agent for each set of the parameters.
Also, the navigation to multiply pages is done by the Paginator (in Edit menu) through Next Page Patterns. Saving data is performed by Data source, better called ‘Data destination’.
Now, just run project. Developers recommend creating one agent per project. The trial version is limited to 100 rows per extraction. The scraping speed is not impressive, though it depends on IP-connection at the time.
The scraper features include NodePath (very similar to XPath), which is used for defining data structures. Also, Regex expressions may be added to specify the data to be extracted.
Scheduling and Server mode are important features, if regular and/or remote scrape is needed.
WebSundew gives an impression of a product in development (testing). As I was learning to use it, some functions, like Paginator, worked with some difficulty.
Summary
On the whole, you can quickly learn to use this data extractor and it’s a multi-functioning tool which embeds Server work, scheduling, results publishing through FTP and supporting Database formats. Command line mode (passing parameters to an Agent from the file) helps you to integrate WebSundew with other software. There is API available for advanced use of the scraper. Text recognition (CAPTCHAs), ISQ messaging, SMS notifications or HTTP requests are available only for Pro and Enterprise editions.