Categories
Development Web Scraping Software

My experience of choosing web scraping platform for company critical data feed

Recently we engaged with the online e-commerce startup for the need of gov. tenders/RFP scraping. Since the project size is immense , we have to switch from the hand made scripting extractors to a enterprise grade scraping platform. Below I share my experience of the scraping platforms as a feature table.

   Service   ResidentialCost/monthTraffic/month$ per GBRotatingIP whitelistingPerformance and moreNotes
MarsProxies
N/AN/A3.5yesyes 500K+ IPs, 190+ locations
Test results
SOCKS5 supported
Proxy grey zone restrictions
Oxylabs.io
N/A25 GB9 - 12

"pay-as-you-go" - 15
yesyes100M+ IPs, 192 countries

- 30K requests
- 1.3 GB of data
- 5K pages crawled
Not allowing to scrape some of grey zone targets, incl. Linkedin.
Smartproxy
Link to the price page

N/A5.2 - 7

"pay-as-you-go" - 8.5
yesyes65M+ IPs, 195+ countriesFree Trial
Not allowing to scrape some of grey zone targets, incl. Linkedin.
Infatica.io
N/AN/A3 - 6.5

"pay-as-you-go" - 8
yesyesOver 95% success
*Bans from Cloudflare are also few, less than 5%.
Black list of sites —> proxies do not work with those.
  • 1000 ports for one Proxy List
  • Up to 20 Proxy Lists at a time
  • Using via API Tool
  • ISP-level targeting
  • Rotation time selection
Mango Proxy
N/A1-50 GB3-8"pay-as-you-go" - 8yesyes90M+ IPs, 240+ countries
IPRoyal
N/AN/A$4.55yesyes32M+ IPs, 195 countriesNot allowing to scrape some of grey zone targets, incl. Facebook. List of bloked sites.
Rainproxy.io yes$ 4from 1 GB4yes
BrightDatayes15
ScrapeOps Proxy AggregatoryesAPI Credits per monthN/A N/AyesAllows multithreading, the service provides browsers at its servers. It allows to run N [cloud] browsers from a local machine.
The number of threads depends on the subscription: min 5 threads.
The All-In-One Proxy API that allows to use over 20+ proxy providers from a single API
Lunaproxy.comyesfrom $15x Gb per 90 days0.85 - 5 Each plan allows certain traffic amount for 90 days limit.
LiveProxies.ioyesfrom $454-50 GB5 - 12yesyesEg. 200 IPs with 4 GB for $70.00, for 30 days limit.
Charity Engine -docsyes-- starting from 3.6
Additionally:
CPU computing
- from $0.01 per avg CPU core-hour
- from $0.10 per GPU-hour - source.
failed to connect so far
proxy-sale.comyesfrom $17N/A
3 - 6

"pay-as-you-go" - 7
yesyes10M+ IPs, 210+ countries30 days limit for a single proxy batch
Tabproxy.comyesfrom $15N/A
0.8 - 3
(lowest price is for a chunk of 1000 GB)
yesyes 200M+ IPs, 195 countries,30-180 days limit for a single proxy batch (eg. 5 GB)
proxy-seller.comyesN/A
N/A
4.5 - 6

"pay-as-you-go" - 7
yesyes15M+ IPs, 220 countries- Generation up to 1000 proxy ports in each proxy list
- HTTP / Socks5 support
- One will be able to generate an infinite number of proxies by assigning unique parameters to each list
Categories
Development

Backconnect Proxy Service with authorization in JAVA

Working with a Backconnect proxy service (Oxylab.io) we spent a long time looking for a way to authorize it. Originally we used JSoup to get the web pages’ content. The proxy() method can be used there when setting up the connection, yet it only accepts the host and port, no authentication is possible. One of the options that we found, was the following:

 

Categories
Review

DataFlowKit review

data-flow-kit-logoRecently we encountered a new service that helps users to scrape the modern web 2.0. It’s a simple, comfortable, easy to learn service – https://dataflowkit.com
Let’s first highlight some of its outstanding features:

  1. Visual online scraper tool: point, click and extract.
  2. Javascript rendering; any interactive site scrape by headless Chrome run in the cloud
  3. Open-source back-end
  4. Scrape a website behind a login form
  5. Web page interactions: Input, Click, Wait, Scroll, etc.
  6. Proxy support, incl. Geo-target proxying
  7. Scraper API
  8. Follow the direction of robots.txt
  9. Export results to Google drive, DropBox, MS OneDrive.
Categories
Review

Oxylabs.io at a glance

Oxylabs Logo VerticalOxylabs.io is an experienced player in the proxy market. In the past few years, they have significantly expanded their proxy pool.

Right now they have a residential proxy pool with over 60M IPs and over 2M datacenter proxies. Their residential proxies cover every country in the world (!) and offer city-level targeting. Oxylabs datacenter proxies come from 82 locations and feature 7850 subnets.

Oxylabs is mainly focused on businesses and it is reflected in their product subscription plans. But recently they have introduced a Fast-Checkout feature, where customers can purchase residential proxies in a few clicks. Together with a recently added smaller plan ($300/month for 20GB of traffic) Oxylabs becomes much more attractive for smaller customers as well.

Categories
Review

Choosing affordable residential proxies for web scraping

Proxies are an integrated part of most major web scraping and data mining projects. Without them, data collection becomes sloppy and biased. This is why it’s essential to know how to find the best affordable proxies for any web scraping project.

One of the best proxy types you could use for scraping is residential proxies. In this post, you’ll learn what they are, how they are priced and what to look for before committing your project’s budget.

Categories
SaaS

Web Page Change Tracking

Often, you want to detect changes on some eBay offerings or get notified of the latest items of interest from craigslist in your area. Or, you want to monitor updates on a website (your competitor’s, for example) where no RSS feed is available. How would you do it, by visiting it over and over again? No, now there are handy tools for website change monitoring. We’ve evaluated some tools and would like to recommend the most useful ones that will make your monitoring job easy. Those tools nicely complement the web scraping software, service and plugins.

Categories
Uncategorized

Death By Captcha new feature Recaptcha v3 support

dbc-logo1After a great deal of work, the Death By Captcha developers have finally released their new feature to the world – new Recaptcha v3 Support.

As you may already know, the Recaptcha v3 API is quite similar in many ways to the previous one used to manage tokens (Recaptcha v2). In Recaptcha v3, the system evaluates or scores each user to determine if it’s bot or human, then it uses the score value to decide if it will accept or not the requests from said user. Lower scores are identified as bots. Check this link to verify the API documentation and download client based sample codes.

With very competitive pricing, Death By Captcha is at the cutting edge of solving tools in the market. Check it out –  you can receive free credit for testing from this LINK; ping the service with the promo code below to receive your captchas.

Use the promo code “Scrapepro” and you’ll get 3k Captchas credit for free.

P. S. See the ReCaptcha v2 test results.

Categories
Miscellaneous

Endcaptcha now solving Recaptcha V2!

endcaptchaSo far the latest developments of the services that develop captchas  (google, nucaptcha, etc.) are no match for the captcha bypassers, and Endcaptcha is living proof of it.
Endcaptcha developers have been working hard to make this new feature possible – they’re finally releasing Recaptcha V2 support!

Categories
Uncategorized

Smartproxy Review

Getting precise and localized data is becoming difficult. Advanced proxy networks are the only thing that is keeping some companies running intense data gathering operations.

Residential proxies are in extremely high demand, and there are only a few networks available that can offer millions of IP addresses around the world. 

Smartproxy is one of those networks, rapidly growing to offer the best product in residential and data center proxies.

Categories
Development

Make crawling easy with Real Time Crawler of Oxylabs.io

logo-oxylabs-ioNowadays, it’s hard to imagine our life without search systems. “If you don’t know something, google it!” –  is one of the most popular maxims in our life. But how many people use Google in an optimal way? A lot of developers use google commands to get needed answers as fast as it possible.

Even this is not enough today! Large and small companies need terabytes of data to make their business profitable. It’s necessary to automate the search process and make it reliable to satisfy the user with fresh news, updates or posts. In today’s article we will consider a very helpful tool – Real-Time Crawler (RTC) for the collection of fresh data. Let’s start!