Categories
Challenge

Most popular web scraping targets and how to scrape them

  1. Online marketplaces
    In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater, www.1188.no).
    Easy to scrape since they are usually free and do not tend to protect their data.
  2. Business directories
    The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.

  1. Competitor companies
    They usually do not provide product online API access. It’s vital to scrape their data if you as a business owner want to do a competitor analysis. Thus you’ll be on the forefront of the market situation.
  2. Reviews sites (eg. TripAdvisor). Highly scrape protected. But TripAdvisor does provide an API for a limited scale. See how one leverages Node.js and Puppeteer to scrape TripAdvisor.
  3. Crowdfunding (eg. Patreon) Requires login to have access to the projects published. Consider the PHP code for logging in thru a web form.
  4. Classifieds (eg. Craigslist). Proxies use is necessary to stay undetected and thus avoid banning when scraping those mass data directories.
  5. Job boards and Online Auctions (eg. indeed.com, eBay) These sites are well cookie and proxies attentive, js-stuffed. They require filling out forms, POST requests, etc. Read how to scrape javascript protected content (js-stuffed sites).
  6. Search engines (eg. google, bing). The hardest cases, since they are very careful about repetitive use of their search results. Even though they do crawling and scraping on a regular base, the systems are not willing to be scraped in turn. My suggestion is to turn to the web services that make provision for scraping SERP (search result pages): DataFlowKit and Oxylabs.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.