Web Content Extractor (WCE) is a simple user-oriented application that scrapes web pages and parses data from them. This program is easy to use, but it gives you the ability to save every single project for future (daily) use. The trial version will work with only 150 records per scrape project. As far as exporting and putting data into different formats, Web Content Extractor is excellent for grouping data into Excel, text, HTML formats, MS Access DB, SQL Script File, MySQL Script File, XML file, HTTP submit form and ODBC Data source. Being a simple and user friendly application, it steadily grows in practical functionality for complex scrape cases.
Let’s see how to scrape data from londonstockexchange.com using Web Content Extractor. First, you need to open the starting page in the internal browser:
Then, you need to define “crawling rules” in order to iterate through all the records in the stock table:
Also, as you need to process all the pages, set the scraper to follow the “Next” link on every page:
After that, drill down into each stock table row and extract information from the “Summary” section. This is done by defining an “Extraction pattern” for getting data fields:
And finally, when you’re done with all the rules and patterns, run the web scraping session. You may track the scraped data at the bottom:
As soon as you get all the web data scraped, export it into the desired destination:
Dynamic Elements Extraction
Scraping of dynamic web page elements (like popups and ajax-driven web snippets) is not an easy task. Originally, we didn’t consider Web Content Extractor to be able to break through here, but with transformation URL script, it’s possible. Thanks to the Newprosoft support center, we got help with crawling over popups on a certain web page.
Go to Project->Properties->Crawling Rules->URL Transformation Script, where you may compose a script that will change casual crawl behavior into a customized one:
Here is an example of such an URL transformation script for scraping popups from http://www.fox.com/schedule/ (the script was composed by a Newprosoft specialist, so in difficult cases you’ll need to turn to them):
Multi-Threading and More
As far as multi-threading, the Web Content Extractor sends several server requests at the same time (up to 20), but remember that each session runs with only one extraction pattern. Filtering helps with sifting through the results.
The Web Content Extractor is a tool to get the data you need in “5 clicks” (the example task we completed within 15 minutes). It works well if you scrape simple pages with minimum complications for your private or small enterprise purposes.