OutWit Hub is a software providing simple data extraction without requiring any programming skills or advanced technical knowledge. What impressed me about Outwit Hub is its general approach to data gathering: harvest everything (links, text, images, etc.) and, then, let the user choose what is needed (sift by scrapers). The program is apt to browse over links on pages, so this feature works well if multiple chains web scraping is required. UPDATE: OutWit Hub 4.0 is released! [box style=”blue rounded info”]We have negotiated a good price for our readers![/box]
Simple Catch for Simple Web
The OutWit Hub is a multi-purpose application under development. I won’t focus on the developers’ side of the software platform (read the overview here), but rather touch on points of practical use by a typical customer. The OutWit Hub interface looks similar to a web browser with additional data views, side panel and automation tool buttons. As it launched, the product starts grabbing stuff from the web, with corresponding views being generated: Links, Documents (pro version), Emails, Images and Data sub-views (tables, lists, even guess) and etc., from which I could easily export data or selectively put them into Catch: OutWit Hub incorporates:
- a web browser
- a data extractor – the hub dissects the page to automatically extract its information and images. Unlike other extractors, this one uploads images, instead of only copying the link.
- a scrape engine (automator) that automatically browses through a series of web pages doing an extraction
The Log panel, Details panel and Catch collection basket are interesting features of this product. Both Log and Catch are hide-able: At the Catch panel, Rating (R) and Priority (P) columns are available and the corresponding values are user defined (through sliders at the bottom) for custom use.
Grab & Export Web Content
Outwit is a handy tool for consistent scraping of documents, links, images and more. As I put a target page link in browser window, clicked my mouse to data -> tables view, all the needed information appeared in table format. I’ve just cached some of them and in the Catch area I’ve chosen Import -> Excel, and voilà, the data are in an Excel table. Altogether, it took me less than a minute. In a similar way, I can easily extract links, images, email addresses, RSS news, data tables, etc. from a page or series of pages. The OutWit performs perfect in automated navigating through series of related pages, especially “Next page” navigation (see ‘Navigation overview‘ in OutWit Hub Help). Extracted data can be exported to CSV, HTML, Excel or SQL scripts to save in databases, while images and documents are directly saved to my hard disk.
Light and Pro Versions
The light version is limited to 100 links/records, without the possibility of scraping documents. Yet, it allows an unlimited number of pictures for extraction. The Pro unleashes all the latest features of the OutWit, such as:
- Get documents to download: .doc, .pdf, .xls, .rtf, .ppt
- Get the recurrent words and groups of words extracted from web pages
- Grab emails, RSS feeds, and etc.
- Set up multiple scrapers
- Macro automation
- Scheduling through Jobs
For all the OutWit Hub features, refer here. If you are solely focused on images or documents extraction, I recommend the single object releases, OutWit Images or OutWit Docs. Those releases are still in beta test mode and are available as only FF extensions, rather than standalone apps.
The OutWit scraper is just a template instructing the Hub on how to extract data from a page. It’s suited for unstructured data on pages that do not fit into table or list categories. A Scraper is defined as a list of fields of records that I want to extract from a page. For each field, I specify the name of the field, the strings (markers) surrounding the data to extract and the data format (regex), if needed. It’s intuitive and simple: To enter Scraper Editing mode go to automators -> scrapers view and click on the ‘New‘ button to create a new instance. There I manually identify markers and delimiters (works only in Pro), based on source code of the current page. Just copy-paste the HTML elements that bracket target data. It’s boring compared to modern software UI. If an advanced scraper is needed, the tutorial (or Tools -> Tutorials) will guide you through all the steps of creating it. If the data requested are in an HTML table, building the scraper is easier. If you want to set your scraper as a dynamic one, then at the scraper view go to ‘Source type’ options and choose “Dynamic” in the drop-down list. That scraper will extract the dynamic Ajax-page contents, but not in ‘Fast scrape’ mode. It will be limited with manual pages browsing or by using the Browse button or Dig button from the Navigation menu.
Fast Scrape Mode is casual scraping for multiple links
Select the URLs to scrape (they should all be in the same column). Then, right-click on them and in the contextual menu, select ‘Auto-Explore Selected Links‘ -> ‘Fast Scrape‘ and choose scraper to apply. In this mode, the Hub allows running the ‘Fast Scrape‘ function on selected URLs. The program runs an XML HTTP request for each URL, but doesn’t actually load the pages (ignoring images, etc.), relieving the engine load. The ‘Fast Scraping Mode‘ is simply faster than just browsing through the URLs, doing the scraping page after page. A problem may appear, since the dynamic page elements are ignored (for example, java scripts), so scraper behavior may go wild, or just hang up, as happened with me at first trial. If this happens, developers recommend quitting this mode and browsing pages manually for scrape.
The scraping ethics and IP banning threat prevent overloading target servers, so especially in ‘Fast Scrape’ mode, there is a need to set up a time delay (temporization) between requests. Go Tools -> Preferences -> Time Settings tab to set it up. I think a 2 sec. delay is normal, if you want to imitate human behavior.
Exporting scrape results is great, almost interactive. Extracted data can be exported to CSV (TSV), HTML or Excel or create SQL scripts to save them in databases.
The multi-threading feature I asked for (though quickly replied to) seems of little value, since the crawl/scrape engine is always on, unceasingly doing the tough job of searching and downloading. Also, the comprehensive help center is available. The help documentation covers all present OutWit Hub’s features. Many views and functions described there are only available in OutWit Hub Pro version.
The Jobs config supports invoking macros on a schedule (only in Hub Pro), macro being a scraper or any other view wrapper adding output functionality. In the current version (3.0), the only type of action which can be programmed in Jobs is a macro. To schedule the data extraction process you need to:
- Create a scraper (see Scrape up section)
- Set up a macro to automate (automators -> macros, view and click on the ‘New‘ button). Check-mark scraped to extract according to a certain scraper and follow the down arrow to choose one scraper of a set, see bottom of the picture:
- Define a job that will schedule the macro execution with flexible time boundaries (automators -> jobs, view and click on the ‘New‘ button):
More to Develop
Developers promote contribution to program features by advanced users/programmers, hoping to cover a large spectrum of demands. So far, only the basic web data extraction features are built into OutWit Hub. This technology has allowed building of simple, one-function applications like the most recent releases: OutWit Images and OutWit Docs, now in beta testing.
This product impresses with its simplicity and straightforwardness, but so far it’s not for high-difficulty level scraping like automated sending of web forms, doing IP rotation (proxying), bypassing CAPTCHAs, etc. The developers promise to add more, so we anticipate the expansion of its capabilities.