Data Scraping Studio (DSS) is a new free, multi-threading studio for effective data extraction. It consists of two parts: (1) the Google Chrome extension with point-&-click interface to setup a web scraping agent and (2) the Desktop app for executing scraping agents.
We’ve seen lots of scraping tools in market when we faced a need for one to be utilized for a shopping comparison website. The need was to scrape eCommerce prices from multiple sites and aggregate them. We couldn’t find a tool to be as easy as “point and click” for scraping robot creation. The other criterion was tool to be scalable as to be able to process million pages per hour in parallel…
The Data Scraping Studio was designed with all the above-mentioned features in mind. Here’s a first look of the product.
Make agent in browser extension
Install the point and click Chrome extension to create web scraping agents. The robot creation is super easy:
- Navigate to the target website
- Use the point-&-click UI to select HTML elements you want to extract
- Agent will generate CSS selectors for those elements
- Ability to deselect elements (indicated in red) that don’t fit your extraction way
- CSS selectors will be automatically adjusted to improve the selection
The extract preview displays matching results and additional options for detailed extraction:
- TEXT : To extract text inside of a HTML tag
- ATTR: To extract attributes from HTML tag. (e.g. data-id, src, href, Meta tag description content etc.)
- HTML: To extract HTML tags
- REGEX : Any regular expression pattern
The agent instant output is downloadable as CSV, TSV or JSON format via the Chrome extension at the agent formation stage.
Once done with the agent setup, save the scraping agent (click “Done” button). The output *.scraping file will be right in your local download directory.
Now you can execute this scraping file with the Data Scraping Studio (desktop app). The studio provides plenty of advanced features like:
- Batch URL crawling
- Large data extraction (hundreds to thousands of web pages)
- Agent run scheduling
- Input load from CSV, TSV, JSON or form your web API
- Posting output to a server using webhooks: You just enter the URL of your webhook and DSS will send the data using a HTTP POST request each time the scraping job completes. Use Zapier.com for this.
- Proxying, password protected website crawling, dynamic inputs load, and other features
Watch the videos to see all the DSS capabilities in action.
DSS as a free software and does not provide any proxying per se, but you might be able add external proxies and configure them for automatic rotation.
Data Scraping Studio lets users make anonymous scraping from websites with the help of configurable, rotating
HTTP proxy server. In order to configure this feature, open a scraping agent for edit, then go to “Advance Setting” tab and then click on the “Proxy” tab. You may add a single proxy address or import a list of proxy addresses in bulk as shown below. Read here for more details.
The Studio is well suited to execute all these features with multiple web scraping jobs in parallel:
The Data Scraping Studio is new and impressive, being in the development stage. It still requires some level of programming for advanced features (proxying, post processing, etc.).
You may download it from here and find the docs and video tutorials here.
[box style=’info blue’]Soon Data Scraping Studio web platform (cloud) and its REST API will be released. Keep up with the updates and their interesting innovations![/box]