Recently we encountered a new service that helps users to scrape the modern web 2.0. It’s a simple, comfortable, easy to learn service – https://dataflowkit.com
Let’s first highlight some of its outstanding features:
- Visual online scraper tool: point, click and extract.
- Open-source back-end
- Scrape a website behind a login form
- Web page interactions: Input, Click, Wait, Scroll, etc.
- Proxy support, incl. Geo-target proxying
- Scraper API
- Follow the direction of robots.txt
- Export results to Google drive, DropBox, MS OneDrive.
Ajax and JS evaluation
Once a task is created, it is possible to build new processes spawned from it. Click Start/Stop buttons to launch/stop a process accordingly.
Detailed information is shown in blocks with current status, start/finish time, requests/responses, and cost. Balance (remaining credits) is shown on the right top of the dashboard.
Proxies are included in the service, but no external proxies plugin option is available.
Select the target country from 100+ supported global locations to send your web/SERPs scraping API requests.
Specify proxy by adding country ISO code to country- value to send requests through a proxy in the specified country.
Use country-any to use random Geo-targets.
Import & export scrape projects
It is possible to reuse saved projects. Go to https://dataflowkit.com/collections
The service uses MongoDB as a central (intermediary) storage.
Users may choose one of the following formats to export data: CSV, JSON (Lines), Excel, XML. The service allows converting the resulted JSON to the most popular RDBMS or Clouds using JSON2SQL tool – https://dbconvert.com/json-sql
After completing a scraping project, data may be downloaded from S3 compatible storage.
Optionally, users can upload scraped data to one of the most popular cloud storage possibilities, including Google Drive, Dropbox, Microsoft Onedrive.
Some service’s bounties
One can internally scale up and down the number of scraping/parsing requests to a web site.
Log in to website
Target website log-in has been implemented. But currently, it is not publicly available.
Page into PDF
Url-to-pdf service is available at https://dataflowkit.com/url-to-pdf
Url-to-screenshots option: https://dataflowkit.com/url-to-screenshot
Here is a list of available actions:
“Input” action – Performs search queries, or fills forms.
“Click” action – Clicks on an element on a web page.
Pagination & infinite scroll
Pagination is supported. “Scroll” action – automatically scrolls a page down to load more content.
DFK API is easy integrated with web applications using a favorite framework or language, including Curl, Go, Node.js, Python, PHP. Dataflow Kit OpenAPI (Swagger) specification is available at https://dataflowkit.com/open-api
It only takes a few minutes to start using its API at scale using visual code generators available at service pages.
One may generate a “ready-to-run” code for your preferred language in no time.
CAPTCHA solving is not supported in the service. Neither scheduling nor multiple users are supported. Neither images nor files scrape are supported. Iterate drop-down options and OCR are not supported even though not many services support it.
Multiple inputs are partially supported.
Requests to web pages for data extraction are measured in CREDITS. Once after sign-up, you are granted free 1000 credits, which is equal to €5 for evaluation and testing. See the following table to evaluate the cost by a scrape type:
|Successful request||Without proxy||Using proxy||Notes|
|1 regular page||1 credit||2 credits||Regular pages are fetched "as is" using basic HTTP requests|
|1 SERP page||-||3 credits||Headless chrome and proxy are always used for Search engine data requests.
We find the service is an easy online way to learn to use a scraping tool with many basic functions pertaining to the scraping suit. The service cost is calibrated dependent on the kind of web pages a user wants to scrape: static web pages , js-rendered pages , and SERP.
Watch a short video of a DataFlowKit scraper creation: