Categories
Challenge Development

Tips & Tricks for Scraping Business Directories

business directoryRecently I received a question in my mail box about scraping data aggregate sites (aka yellow pages) or business directories.
I replied to him directly, but our conversation on business directories was an interesting one that I thought you guys would find useful. 

Here’s the question:

I am interested in scraping the database in such a website www.1881.no. My guess is that I would need a webdriver, like Selenium to do the job. I am very newbie to this field, but I believe if given some pointers, I can get some data out.

Could you please provide me with pointers on how to extract data from this website.
Sandeep

As a generic answer, I’ll provide you with some basics of scraping those business (and private life) directories.

Aggregators’ features

First of all you need to be clear of data aggregators’ characteristics

  • Those kind of services aggregate a huge amount of data, it’s often hard to estimate. Most likely you need to develop and run a special script for getting know the site’s estimated data amount.
  • You need to query the data to fetch them since no predefined pages are there. Querying example is http://www.1881.no/?query=car. So you should properly make search terms, ex. [car, pizza, home etc…] to query against the site. Study up on GET and POST HTTP requests.
  • The data in those aggregators changes over time, so you should set a scraper/crawler to detect outdated info. That usually involves a special algorithm and thus is much harder than a straightforward scrape.
  • Those kind of sites are especially vigilant about using anti-scraping measures to avoid data leaks. So be ready for unexpected pits falls and unbreakable firewalls. You might want to read some of my previous posts about anti-scraping tools to get a better understanding of some of them.

Scraping tips

  1. Because the amount of data on these sites is so huge, you’ll want to store it in an appropriate Database. Setting up DB will make sure your data is easily accessible later.
  2. To remain undetected (unbanned) by such aggregators you’ll need to adopt these two scraping methods:
  3. There are some off-the-shelf scraping softwares that are pretty well suited for such fine and tedious tasks. You might see an example of such a scraping software accomodating a free proxy network account for business directories scrape.

The Nohodo anonymization network provides a special offer for webscraping.pro customers.

3 replies on “Tips & Tricks for Scraping Business Directories”

Hello

Thanks for the great information here. I’m looking for a web scraper that will allow me to extract info from popup links, the ones that do not open in another tab, but can only be seen on the original page when you click on them. I have been using import.io originally for basic scraping of directories, but now on a different website it does not recognize the info.

I’d really appreciate it if you could suggest a scraper that may allow me to do this.

Thanks

Thomas

Nice article. It really helped me a lot to learn about business directory scraping. Doing it manually was really hard for me. I have done some research on this & I think yellowscraper.com will also help some of the users who are looking for a good data scraping solution. Keep up the good work & I would like to get another post regarding scraping data from social media websites. Thanks.

Leave a Reply to Igor Savinkin Cancel reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.