Tag: web scraping

My experience of manual, no-code scrape of a bot-protected site

Post author By admin
Post date April 5, 2024
No Comments on My experience of manual, no-code scrape of a bot-protected site

Recently we discovered a highly protected site — govets.com. Since the number of target brand items of the site was not big (under 3K), I decided to get target data using the handy tools for a fast manual scrape.

Tags JSON, scrape protection, web scraping

Development

Crawling web pages with Netpeak Spider in conjunction with MarsProxies, NetNut and IPRoyal proxies

Post author By admin
Post date March 1, 2024
No Comments on Crawling web pages with Netpeak Spider in conjunction with MarsProxies, NetNut and IPRoyal proxies

Agree, it’s hard to overestimate the importance of information – “Master of information, master of situation”. Nowadays, we have everything to become a “master of situation”. We have all needed tools like spiders and parsers that could scrape various data from websites. Today we will consider scraping the Amazon with a web spider equipped with proxy services.

Tags crawling, proxy, service, web scraping

Development

Choosing the Best Proxies for Web Scraping

Post author By admin
Post date September 21, 2023
No Comments on Choosing the Best Proxies for Web Scraping

From eCommerce and market research to competitive analysis and more, web scraping has become an integral part of data collection. And for some, it’s the secret sauce for success.

But with great scraping power comes great responsibility.

Web scraping can result in IP bans and other harsh restrictions. To avoid these issues, many turn to proxies, which act as intermediaries between your requests and the target website. In this article, we’ll explore the top 3 proxy types for web scraping and focus on the key benefits of each proxy. Let’s go!

Tags proxy, service, web scraping

Challenge Development

Python, Selenium for custom browser automation scraper

Post author By admin
Post date April 10, 2023
No Comments on Python, Selenium for custom browser automation scraper

Recently we’ve got the tricky website, its data being of dynamic nature. Yet we’ve applied the modern day scraping tools to fetch data. We’ve develop an effective Python scraper using Selenium library for browser automation.

About the project

We were asked to have a look at a retailer website.

And our task was to gather data on 210 products’ availability in 945 shops. The scrape resulted in about 200K data entries in a CSV format. Moreover, every line contained information about name, link, brand, store and the availability of a product. Below you can familiarise yourself with a small data sample we were able to gather.

Tags browser-automation, Python, Selenium, web scraping

Challenge

Web Scraping: 5 pros and cons

Post author By admin
Post date March 17, 2023
No Comments on Web Scraping: 5 pros and cons

Web scraping, also known as data mining or web harvesting, is the process of extracting data from websites automatically. The extracted data can be used for various purposes, such as market research, price monitoring, sentiment analysis, and many more. However, web scraping has both advantages and disadvantages. In this article, we will discuss the five main pros and cons of web scraping.

Tags web scraping

Challenge Development

AirTable scrape challenge

The problem is that data are loaded highly dynamically. HTML contains only the information that you currently see on the browser screen.

If there are a lot of records then it is difficult to collect such a table. One of the possible ways is to calculate the size of the screen and rows in the table. Then using the browser automation and use to make a script that will scroll through it bit by bit and collect data.

Is there any other feasible way to get data of a table? For example there is a HTTP requests coding way to get dynamic data.

JS infinite scroll does not work for AirTable either.

Please comment down here if having some tips, hints.

Tags web scraping

Development

Google Sheets or MS Excel to scrape business directories ?

Post author By admin
Post date September 27, 2022
No Comments on Google Sheets or MS Excel to scrape business directories ?

We’ve already stated some Tips and Tricks of scraping business directories or data aggregators sites. Yet recently someone has asked us to do aggregators’ scraping in the context of Google Sheets and/or MS Excel.

Tags business directory, web scraping

Challenge Development

Yelp scraping for high quality B2B leads

Post author By mihaschenko
Post date September 16, 2022
No Comments on Yelp scraping for high quality B2B leads

Recently we’ve performed the Yelp business directory scrape for acquiring high quality B2B leads (company + CEO info). This forced us to apply many techniques like proxying, external company site scrape, email verification and more.

Tags business directory, JAVA, web scraping

Development Web Scraping Software

My experience of choosing web scraping platform for company critical data feed

Post author By admin
Post date February 23, 2022
No Comments on My experience of choosing web scraping platform for company critical data feed

Service	Residential	Cost/month	Traffic/month	$ per GB	Rotating	IP whitelisting	Performance and more	Notes
MarsProxies		N/A	N/A	3.5	yes	yes	500K+ IPs, 190+ locations Test results	SOCKS5 supported Proxy grey zone restrictions
Oxylabs.io		N/A	25 GB	9 - 12 "pay-as-you-go" - 15	yes	yes	100M+ IPs, 192 countries - 30K requests - 1.3 GB of data - 5K pages crawled	Not allowing to scrape some of grey zone targets, incl. Linkedin.
Smartproxy		Link to the price page	N/A	5.2 - 7 "pay-as-you-go" - 8.5	yes	yes	65M+ IPs, 195+ countries	Free Trial Not allowing to scrape some of grey zone targets, incl. Linkedin.
Infatica.io		N/A	N/A	3 - 6.5 "pay-as-you-go" - 8	yes	yes	Over 95% success *Bans from Cloudflare are also few, less than 5%.	Black list of sites —> proxies do not work with those. 1000 ports for one Proxy List Up to 20 Proxy Lists at a time Using via API Tool ISP-level targeting Rotation time selection
Mango Proxy		N/A	1-50 GB	3-8"pay-as-you-go" - 8	yes	yes	90M+ IPs, 240+ countries
IPRoyal		N/A	N/A	$4.55	yes	yes	32M+ IPs, 195 countries	Not allowing to scrape some of grey zone targets, incl. Facebook. List of bloked sites.
Rainproxy.io	yes	$ 4	from 1 GB	4	yes
BrightData	yes			15
ScrapeOps Proxy Aggregator	yes	API Credits per month	N/A	N/A		yes	Allows multithreading, the service provides browsers at its servers. It allows to run N [cloud] browsers from a local machine. The number of threads depends on the subscription: min 5 threads.	The All-In-One Proxy API that allows to use over 20+ proxy providers from a single API
Lunaproxy.com	yes	from $15	x Gb per 90 days	0.85 - 5				Each plan allows certain traffic amount for 90 days limit.
LiveProxies.io	yes	from $45	4-50 GB	5 - 12	yes	yes		Eg. 200 IPs with 4 GB for $70.00, for 30 days limit.
Charity Engine -docs	yes	-	-	starting from 3.6 Additionally: CPU computing - from $0.01 per avg CPU core-hour - from $0.10 per GPU-hour - source.			failed to connect so far
proxy-sale.com	yes	from $17	N/A	3 - 6 "pay-as-you-go" - 7	yes	yes	10M+ IPs, 210+ countries	30 days limit for a single proxy batch
Tabproxy.com	yes	from $15	N/A	0.8 - 3 (lowest price is for a chunk of 1000 GB)	yes	yes	200M+ IPs, 195 countries	,30-180 days limit for a single proxy batch (eg. 5 GB)
proxy-seller.com	yes	N/A	N/A	4.5 - 6 "pay-as-you-go" - 7	yes	yes	15M+ IPs, 220 countries	- Generation up to 1000 proxy ports in each proxy list - HTTP / Socks5 support - One will be able to generate an infinite number of proxies by assigning unique parameters to each list

Tags service, web scraping

Challenge SaaS

Web Scraper IDE to scrape tough websites

Post author By admin
Post date November 10, 2021
No Comments on Web Scraper IDE to scrape tough websites

Recently we encountered a new powerful scraping service called Web Scraper IDE [of Bright Data]. The life-test and thorough drill-in are coming soon. Yet now we want to highlight its main features that has badly (in positive sense, strongly) impressed us.

Tags business directory, CloudFlare, proxy, web scraping