Categories
Challenge

My experience of manual, no-code scrape of a bot-protected site

Recently we’ve ancounted a highly protected site — govets.com. Since the number of target brand items of the site was not big (under 3K) so I desided to get target data using the handy tools for a fast manual scrape.

Test for anti-bot techniques

First of all we’ve tested the site for the antibot techniques at a Discord server. The result was successful. The following anti-bots measures were detected:

1. Cloudflare

  • Headers: cf-chl-gen, cf-ray, cf-mitigated
  • Server header: cloudflare

Detected on 116 urls

2. Recaptcha

  • Script loaded: recaptcha/api.js
  • JavaScript Properties: window.grecaptcha, window.recaptcha

Tools for manual scrape

  1. Browser Edge or Chrome. All the scrape is done using browser extentions. Note: Chrome extensions work perfectly at Edge.
  2. Bulk URL Opener addon for Chrome or Edge
  3. Free VPN — Microsoft Edge Addon — a browser VPN for surfing with USA IP.
  4. Easy ScraperFree Web Scraping extention
  5. Notepad++ — a free editing enviroment.
  6. Windows command line. Hit cmd in the Windows Search Box and it’ll be on [for joining a pull of scraped files into one].
  7. Online JSON to CSV converter.
Browser extention

1. Turn on VPN addon

Turn on VPN browser addon to US IP to be able to surf the US-restricted site. This made us able to open govets.com site. Note: if you are using Chrome browser then choose any free VPN addon altenative from a list.

2. Identify URLs pool

First we make an initial on-site search and get number of results (2674). We scroll down a search result page and set the maximum Results per Page value. Besides, by hoveing to a pagination link, eg. 2 , we find out the basic URL structure. See the figure below:

Now the URLs pool (27 items) is the following:

3. Open URls set in bulk

We open in browser the Bulk URL Opener ext. and load 27 target urls into it:

The gear button opens the list management functionality.

4. Scrape each page with “Easy Scraper” addon

As we switch to each tab (with results) we run “Easy Scraper” addon by a button click:

Now the extention is on and it automatically identifies and gathers data of 100 items present:

Note, the govets.com search results provide quite good set of features for each product incl. image, SKU, price and discount price, see the image to the left. If the result product data are not enough, we should first gather product URLs and then use Easy Scraper addon in the “Scrape Details” mode. Otherwise we are to develop a code scraper to get data off each product page. See a figure below:

For each Easy Scraper instance [that contains 100 results] we download them whether as CSV (US-type, comma separated) or JSON. Since my PC is set to semicolon as a field separator (Europe-mode) I prefered to load data as [multiple] JSON files for later joining:

5. Join multiple [result] files into one

Is there one command that would join multiple files into one ?

Linux users would definitely answer positively, yet Win or Mac ones would not be so sure. Luckily we’ve done it before and even made a post about it: Merge files in Windows CMD & in Power Shell. The line below has easily made files to join:

cmd /c 'copy /y /b *.json govets-resuls.json'

But because of JSON structure the joined file had a specific characters (brackets) that must be substituted with commas, see line 4 at the code below:

    "product-item-link": "Partition 10-15/16 In x 36-1/8 In MPN:P300-54",
    "price": "$58.49"
  }
][
  {
    "product href": "https://www.govets.com/lista-310-16974073.html",
    "product-image-photo src": "

Notepad++ has done excellently to replace ][ with comma — , yielding a perfect JSON.

6. Turn JSON into CSV and remove duplicates

There are plenty of tools incl. online ones that convert JSON into CSV. I used one (convertcsv.com) that facilitates to customise field separator to one of Europe style (semicolon).

Since we did extensive scrape so we better to check for duplicates when the file is in Excel: Data —> Remove Duplicates. Voilà. Data are at hand.

Note: There might be need for some data modifications (eg. split, symbols removal, etc.) to be done inside of Excel.

Conclusion

The whole scrape process took me less than an hour and I was quite happy with the given results. With that in view I consider this particular manual scrape case had a very good quality-price ratio:


There might be some obsticales or nuances that only scrape / data experts are able to overcome. So feel free to contact me for help or consultaton: igor [dot] savinkin [at] gmail [dot] com

Теlеgrам: igorsavinkin

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.