Categories
Challenge Development

Business directory simple scraper (python) at pythonanywhere

business directoryMy goal was to retrieve data from a web business directory.

Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:

  1. Is there any scrape protection set at that site?
  2. How much data is in that web business directory?
  3. What kind of queries can I run to find all the directory’s items?
Categories
Challenge

Most popular web scraping targets and how to scrape them

  1. Online marketplaces
    In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater, www.1188.no).
    Easy to scrape since they are usually free and do not tend to protect their data.
  2. Business directories
    The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.
Categories
Challenge Development

Scraping a Javascript-dependent website with puppeteer

Scraping a Javascript-dependent website with puppeteer
Get eBook (PDF) at low rate to support the web development

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Categories
Challenge

Is there any way to skip CAPTCHA?

JavaScript powered CAPTCHA

Most of the answers to the question in internet forums are given by services that automatically solve captchas. They provide services to solve CAPTCHA rather than to fully skip it.

Categories
Challenge Development

How do I get pass dynamic “load more” btn?

Recently I’ve got a question:

How do I get pass the dynamic “load more” button using a Python web scraper?

Categories
Challenge

How to insert and configure reCAPTCHA v2 code in php

We’ve already introduced you to the theory behind the new NO CAPTCHA reCAPTCHA v2, but now we come to the practical integration part. Here we’ll share how to insert and configure “NO CAPTCHA reCAPTCHA” into a web page.

Categories
Challenge

No CAPTCHA reCaptcha challenge

Sooner or later a new generation of spam protection methods will emerge to block all unwanted site visitors. The recently launched Google “No CAPTCHA reCaptcha” or ReCaptcha v2.0 could just be such a method.

This new behaviour analysis tool is getting more and more attention both from the site owners and from scraping engines who are trying to break it. Since Google does not reveal any secrets of its operation, we want to share with you the techniques used in this new smart analysis CAPTCHA that determines between bot and human. Let s look inside.

Categories
Challenge

Q&A with ScrapeHero

In this post we’d like to share an interview with a young service called ScrapeHero. We’ve interviewed Tony Paul (marketing head) and this is what he had to say.

Categories
Challenge Review

BotDefender Analysis

Here I’d like you to get familiar with an online scraping protection service called BotDefender. It’s interesting both to know how to use it (in case you want to protect your data) and to understand how it works in case you ever come across it while collecting data.