Categories
Development

How to parse messy encoded HTML

Let’s suppose you want to extract a price with a currency sign from a web page (eg. £220.00), but its HTML code is this:

<div>cost: &#163;220.00</div>

which is obviously encoded HTML.

Categories
Development Web Scraping Software

Dexi.io REST API in php (example)

In this post, I’d like to demonstrate how to leverage the Dexi.io (CloudScrape) API along with its PHP Client library (also avail in Ruby and C#).

Categories
Miscellaneous

Content Grabber self-contained (standalone) agent

As web scraping is becoming easier to use, more and more people are able to leverage the world’s web resources. As this trend grows, structured data from the web empower businesses and enable a wave of new business ideas to become a reality. Now there is a new technology on the market called: “self-contained agents” that might just make this a tsunami!

Categories
Development

Extract browser’s Local Storage with Python

Some of you may be wondering if it’s possible to extract a web browser’s local storage by web scraping?

Categories
Web Scraping Software

A tool to extract phone numbers from a list of URLs

Today I got a question from one of my readers asking if there is a good out-of-the-box solution for crawling multiple websites for contact information. 

Categories
Development

Solve ReCaptcha with Selenium (python)

breaked by seleniumI’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.

Categories
Development SEO and Growth Hacking

ReCaptcha to be solved with iMacros

breaked by imacroRecently I’v been getting requests for a tutorial showing how to solve Google’s No CAPTCHA ReCaptcha. I’ve introduced it before and promised to work out a script to automate solving it. And here’s what I’ve come up with.

Categories
SEO and Growth Hacking

Automate your social marketing by bulk tweeting all your blog posts

twitter_auto_logoA good social presence is important for any successful blogger. But running a full time blog and keeping up your tweet volume is incredibly time consuming. It would be so much more convenient if you could set up bulk tweets for all your posts. Recently as I was doing some reCaptcha automation, I came up with an idea to use the iMacros browser plugin to automate just such a task. Here’s how I did it…

Categories
Development Web Scraping Software

Content Grabber with free proxy account integration for business directories scrape

Professional data extraction requires adequate proxying to keep anonymity of scraping robots. When attempting to extract large data sets (over 1M records, ex. business directories) reliable and fast proxy service is needed.

Sequentum has released the Nohodo proxy service integration for Content Grabber. Nohodo provides a free account for Content Grabber users (up to 5000 requests monthly for free). The feature is available for both trial users and regular customers. Here’s how it works…

Categories
Featured Web Scraping Software

Dexi.io Review

dexi-medium-height-130pxDexi.io is a powerful scraping suite. This cloud scraping service provides development, hosting and scheduling tools. The suite might be compared with Mozenda for making web scraping projects and runnig them in clouds for user convenience. Yet it includes the API, each scraper being a json definition similar to other services like import.io, kimono lab and parseHub.