Categories
Challenge

What are the best online resources to acquire data?

Recently I received this question: What are the best online resources to acquire data from?

The top sites for data scrape are data aggregators. Why are they top in data extraction?
They are top because they provide the fullest, most comprehensive data [sets]. The data in them are highly categorized. Therefore you do not need to crawl and fetch other resources and then combine multiple-resource data.

Those sites fall into 2 categories:

  1. Goods and services aggregators. Eg. AliExpress, Amazon, Craiglist.
  2. Personal data and companies data aggregators. Eg. Linkedin, Xing, YellowPages. For such aggregators another name is business directories.

The first category of sites and services is quite wide-spread. These sites and services promote their goods with the goal of being well-known online, to have as many backlinks as possible to them.

The second category, the business directories, does not tend to reveal its data to the public. These directories rather promote their brand and give scraping bots minimum opportunity for data acquiring*.

Consider the following picture where a company’s data aggregator gives to the user only 2 input fields: what and where.

You can find more of how to scrape data aggregators in this post.

————–
*You have to adhere to the ToS of each particular website/web service when you perform its data scraping.

Categories
Challenge Development

How do I get pass dynamic “load more” btn?

Recently I’ve got a question:

How do I get pass the dynamic “load more” button using a Python web scraper?

Categories
Challenge Development Web Scraping Software

Brigth Data residential proxy for extracting from a data aggregator

In this post I’d like to share my experience with scraping data aggregator/business directory using the residential proxy of the Bright Data proxy provider in conjuction with its proxy manager.

Categories
Challenge

Prevent automated services from solving captcha?

Question: Is there any way to include captcha on the site and at the same time prevent services like 2captcha from resolving it?

Categories
Challenge

How to insert and configure reCAPTCHA v2 code in php

We’ve already introduced you to the theory behind the new NO CAPTCHA reCAPTCHA v2, but now we come to the practical integration part. Here we’ll share how to insert and configure “NO CAPTCHA reCAPTCHA” into a web page.

Categories
Challenge

No CAPTCHA reCaptcha challenge

Sooner or later a new generation of spam protection methods will emerge to block all unwanted site visitors. The recently launched Google “No CAPTCHA reCaptcha” or ReCaptcha v2.0 could just be such a method.

This new behaviour analysis tool is getting more and more attention both from the site owners and from scraping engines who are trying to break it. Since Google does not reveal any secrets of its operation, we want to share with you the techniques used in this new smart analysis CAPTCHA that determines between bot and human. Let s look inside.

Categories
Challenge Development

Tips & Tricks for Scraping Business Directories

business directoryRecently I received a question in my mail box about scraping data aggregate sites (aka yellow pages) or business directories.
I replied to him directly, but our conversation on business directories was an interesting one that I thought you guys would find useful. 

Here’s the question:

I am interested in scraping the database in such a website www.1881.no. My guess is that I would need a webdriver, like Selenium to do the job. I am very newbie to this field, but I believe if given some pointers, I can get some data out.

Could you please provide me with pointers on how to extract data from this website.
Sandeep

As a generic answer, I’ll provide you with some basics of scraping those business (and private life) directories.

Categories
Challenge

Q&A with ScrapeHero

In this post we’d like to share an interview with a young service called ScrapeHero. We’ve interviewed Tony Paul (marketing head) and this is what he had to say.

Categories
Challenge Development

CloudFlare – a limited feature anti-content-duplicate tool

Here we come to the next anti-scrape tool, called CloudFlare, former ScrapeShield.

CloudFlare

The CloudFlare app has been developed by CloudFlare to guard a site’s content. Its features are limited number, but it’s still an interesting tool to look at for anyone interested in web scraping.

Categories
Challenge Review

BotDefender Analysis

Here I’d like you to get familiar with an online scraping protection service called BotDefender. It’s interesting both to know how to use it (in case you want to protect your data) and to understand how it works in case you ever come across it while collecting data.