Categories
Development Guest posting

Captcha solving with Java and why you should avoid it

In this blog post we are going to show how you can solve [Re]captcha with Java and some third party APIs, and why you should probably avoid them in the first place. For the Python code (+ captcha API) see that post. The post author is Kevin Sahin from ScrapingNinja.co. Captcha solving “Completely Automated Public Turing test to tell Computers and […]

Categories
Challenge Development

How do I get pass dynamic “load more” btn?

Recently I’ve got a question: How do I get pass the dynamic “load more” button using a Python web scraper?

Categories
Development

Selenium using proxy gateway, how?

I develop a web scraping project using Selenium. Since I need rotating proxies [in mass quantities] to be utilized in the project, I’ve turned to the proxy gateways (nohodo.com, charityengine.com and some others). The problem is how to incorporate those proxy gateways into Selenium for surfing web?

Categories
Development

Headless browser python scraper at pythonanywhere

Recently I decided to work with pythonanywhere.com for running python scripts on JS stuffed websites. Originally I tried to leverage the dryscrape library, but I failed to do it, and a nice support explained to me: “…unfortunately dryscrape depends on WebKit, and WebKit doesn’t work with our virtualisation system.”

Categories
Miscellaneous SaaS

CloudScrape to transform into Dexi.io

We have already written some posts on CloudScrape, a Copenhagen, Denmark-based web scraping service startup. The service now has a new look and new features for data extraction and business intelligence – with the launch of new name: Dexi.io.

Categories
Development

Extract browser’s Local Storage with Python

Some of you may be wondering if it’s possible to extract a web browser’s local storage by web scraping?

Categories
Development

Solve ReCaptcha with Selenium (python)

I’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.

Categories
Uncategorized

My site is being scraped, how can I prevent being scraped?

As anyone who has spent any time on the scraping field will know, there are plenty of anti-scraping techniques on the market. And since I regularly get asked what the best way to prevent someone from scraping a site, I thought Id do a post rounding up some of the most popular methods. If you […]

Categories
Challenge Development

Tips & Tricks for Scraping Business Directories

Recently I received a question in my mail box about scraping data aggregate sites (aka yellow pages) or business directories. I replied to him directly, but our conversation on business directories was an interesting one that I thought you guys would find useful.  Here’s the question: I am interested in scraping the database in such a […]

Categories
Web Scraping Software

Scraping software and services landscape

After almost 3 years in running this scraping blog and reviewing dozens of products; in this small post I’d like to categorise the tools/means used for web scraping available to end user. Here are the typical examples of scrapers in those categories.