Category: Challenge

Business directory simple scraper (python) at pythonanywhere

Post author By admin
Post date July 3, 2020
No Comments on Business directory simple scraper (python) at pythonanywhere

My goal was to retrieve data from a web business directory.

Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:

Is there any scrape protection set at that site?
How much data is in that web business directory?
What kind of queries can I run to find all the directory’s items?

Tags business directory

Challenge

Scraping a Javascript-dependent website with puppeteer

Post author By admin
Post date June 25, 2020
No Comments on Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Tags Javascript, Node.js, scrape protection

Challenge Development

Scrape with Google App Script

Post author By admin
Post date January 16, 2020
24 Comments on Scrape with Google App Script

In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).

Tags Google

Challenge

Is there any way to skip CAPTCHA?

Post author By admin
Post date December 15, 2019
No Comments on Is there any way to skip CAPTCHA?

JavaScript powered CAPTCHA

Most of the answers to the question in internet forums are given by services that automatically solve captchas. They provide services to solve CAPTCHA rather than to fully skip it.

Tags captcha, free, Recaptcha

Challenge

What are the best online resources to acquire data?

Post author By admin
Post date July 6, 2019
No Comments on What are the best online resources to acquire data?

Recently I received this question: What are the best online resources to acquire data from?

The top sites for data scrape are data aggregators. Why are they top in data extraction?
They are top because they provide the fullest, most comprehensive data [sets]. The data in them are highly categorized. Therefore you do not need to crawl and fetch other resources and then combine multiple-resource data.

Those sites fall into 2 categories:

Goods and services aggregators. Eg. AliExpress, Amazon, Craiglist.
Personal data and companies data aggregators. Eg. Linkedin, Xing, YellowPages. For such aggregators another name is business directories.

The first category of sites and services is quite wide-spread. These sites and services promote their goods with the goal of being well-known online, to have as many backlinks as possible to them.

The second category, the business directories, does not tend to reveal its data to the public. These directories rather promote their brand and give scraping bots minimum opportunity for data acquiring*.

Consider the following picture where a company’s data aggregator gives to the user only 2 input fields: what and where.

You can find more of how to scrape data aggregators in this post.

————–
*You have to adhere to the ToS of each particular website/web service when you perform its data scraping.

Tags business directory

Challenge Development

How do I get pass dynamic “load more” btn?

Post author By admin
Post date January 6, 2019
3 Comments on How do I get pass dynamic “load more” btn?

Recently I’ve got a question:

How do I get pass the dynamic “load more” button using a Python web scraper?

Tags Javascript, Selenium

Challenge Development Web Scraping Software

Brigth Data residential proxy for extracting from a data aggregator

Post author By admin
Post date August 11, 2018
3 Comments on Brigth Data residential proxy for extracting from a data aggregator

In this post I’d like to share my experience with scraping data aggregator/business directory using the residential proxy of the Bright Data proxy provider in conjuction with its proxy manager.

Tags business directory, proxy

Challenge

Prevent automated services from solving captcha?

Post author By admin
Post date November 30, 2017
No Comments on Prevent automated services from solving captcha?

Question: Is there any way to include captcha on the site and at the same time prevent services like 2captcha from resolving it?

Tags captcha

Challenge

How to insert and configure reCAPTCHA v2 code in php

Post author By admin
Post date August 12, 2015
9 Comments on How to insert and configure reCAPTCHA v2 code in php

We’ve already introduced you to the theory behind the new NO CAPTCHA reCAPTCHA v2, but now we come to the practical integration part. Here we’ll share how to insert and configure “NO CAPTCHA reCAPTCHA” into a web page.

Tags captcha, PHP, Recaptcha