Tips & Tricks for Scraping Business Directories

business directory Recently I received a question in my mail box about scraping data aggregate sites (aka yellow pages) or business directories.
I replied to him directly, but our conversation on business directories was an interesting one that I thought you guys would find useful.

Here’s the question:

I am interested in scraping the database in such a website www.1881.no. My guess is that I would need a webdriver, like Selenium to do the job. I am very newbie to this field, but I believe if given some pointers, I can get some data out.

Could you please provide me with pointers on how to extract data from this website.
Sandeep

As a generic answer, I’ll provide you with some basics of scraping those business (and private life) directories.

Aggregators’ features

First of all you need to be clear of data aggregators’ characteristics

Those kind of services aggregate a huge amount of data, it’s often hard to estimate. Most likely you need to develop and run a special script for getting know the site’s estimated data amount.
You need to query the data to fetch them since no predefined pages are there. Querying example is http://www.1881.no/?query=car. So you should properly make search terms, ex. [car, pizza, home etc…] to query against the site. Study up on GET and POST HTTP requests.
The data in those aggregators changes over time, so you should set a scraper/crawler to detect outdated info. That usually involves a special algorithm and thus is much harder than a straightforward scrape.
Those kind of sites are especially vigilant about using anti-scraping measures to avoid data leaks. So be ready for unexpected pits falls and unbreakable firewalls. You might want to read some of my previous posts about anti-scraping tools to get a better understanding of some of them.

Scraping tips

Because the amount of data on these sites is so huge, you’ll want to store it in an appropriate Database. Setting up DB will make sure your data is easily accessible later.
To remain undetected (unbanned) by such aggregators you’ll need to adopt these two scraping methods:
- IP-proxying. See some posts on using IP-proxying with scraping software, especially Reliable rotating proxies for business directories scrape or Using residential proxies with a spider software.
- Imitating human behaviour by using some browser automation tools (Selenium, iMacros and others)
There are some off-the-shelf scraping softwares that are pretty well suited for such fine and tedious tasks. You might see an example of such a scraping software accomodating a free proxy network account for business directories scrape.

The Nohodo anonymization network provides a special offer for webscraping.pro customers.

3 replies on “Tips & Tricks for Scraping Business Directories”

Hello

Thanks for the great information here. I’m looking for a web scraper that will allow me to extract info from popup links, the ones that do not open in another tab, but can only be seen on the original page when you click on them. I have been using import.io originally for basic scraping of directories, but now on a different website it does not recognize the info.

I’d really appreciate it if you could suggest a scraper that may allow me to do this.

Thanks

Thomas

Thomas, now I’ve in mind two options for you: VWR and Content Grabber. Read thoroughly their reviews and download them for free to make trial projects. They perform well on popups and ajax loads.

Nice article. It really helped me a lot to learn about business directory scraping. Doing it manually was really hard for me. I have done some research on this & I think yellowscraper.com will also help some of the users who are looking for a good data scraping solution. Keep up the good work & I would like to get another post regarding scraping data from social media websites. Thanks.

Aggregators’ features

Scraping tips

3 replies on “Tips & Tricks for Scraping Business Directories”

Leave a Reply Cancel reply