Categories
Development

How to check if a target page loads data thru XHR (Ajax)

When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer).

So, I’ve discovered an Apify Web Page Analyzer, a free scraper agent that analyses a target site and returns inclusive JSON data of the target web page. The presence of XHR (AJAX) helps me to decide what type of crawler to use for scraping that website.

Web Page Analyzer

Performs analysis of a webpage to figure the best way to scrape its data. On input, it takes a URL and array of strings to search for, and on output, it returns a definition of a crawler.

The page analyzer makes searches based on its input data. I don’t need to provide search terms though for single [product] page analysis. I only supply the product url and get the needed output.

1. Clone the actor (Apify agent)

git clone https://github.com/apify/actor-page-analyzer.git

Note: you need to install Node.js and Apify package on your PC/Mac/Linux to be able to run Node.js Apify agent/actor.

2. Add input data

Inside the project folder go to src\apify_storage\key_value_stores\default

In that folder you create and save a file INPUT.json . Supply the url with the actual product so that scraper will return the analytic data. The input example data:

{
    "url": "https://www.ozon.com/Sanimaster_PE80.html",
    "searchFor": [ "test" ]
}

Both parameters are required. We’ve provided the second one, searchFor, value just for representation.

3. Run the actor (I did it locally)

Move to the folder src and run node index.js

After the scraper does his job you’ll be able to find an OUPUT.json file in the same directory as INPUT.json is.

4. Analyze the OUTPUT.json

Scraper run log, shortened:

INFO: System info {"apifyVersion":"0.14.15","apifyClientVersion":"0.5.26","osTyp
e":"Windows_NT","nodeVersion":"v11.10.0"}
2021-01-22T11:59:31.378Z '0s' 'Loading data from input'
INFO: Launching Puppeteer {"stealth":true,"headless":true,"args":["--no-sandbox"
,"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.3
6 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"],"defaultViewport":{"wi
dth":1366,"height":768},"pipe":true}
================================
https://www.ozon.com/Sanimaster_PE80.html
================================
2021-01-22T11:59:32.283Z '0.91s' 'analysisStarted'
2021-01-22T11:59:32.472Z '0.19s' 'scrapping started'
2021-01-22T11:59:33.302Z '0.83s' 'initial response'
2021-01-22T11:59:33.305Z '0s' 'start of html: <!DOCTYPE html><html>...
2021-01-22T11:59:33.477Z '0.17s' 'metadata searched'
2021-01-22T11:59:33.483Z '0.01s' 'json-ld searched'
2021-01-22T11:59:33.562Z '0.08s' 'schema org searched'
2021-01-22T11:59:33.575Z '0.01s' 'initial html searched'
2021-01-22T11:59:37.292Z '3.72s' 'loaded'
2021-01-22T11:59:37.293Z '0s' 'requests'
2021-01-22T11:59:37.384Z '0.09s' 'xhrRequests searched'
2021-01-22T11:59:47.418Z '10.03s' 'html'
2021-01-22T11:59:47.497Z '0.08s' 'html searched'
2021-01-22T11:59:47.514Z '0.02s' 'window properties'
2021-01-22T11:59:47.516Z '0s' 'window properties searched'
done
2021-01-22T11:59:47.517Z '0s' 'scrapping finished'
2021-01-22T11:59:47.518Z '0s' 'Force write of output with await'
2021-01-22T11:59:47.538Z '0.02s' 'Analyzer finished'

OUTPUT.json
By reviewing that file I’ve discovered an XHR that makes call to the server to retrieve a product delivery time:

"xhrRequests": [
    ... 
    {
      "url": "https://www.ozon.com/availability3.jsp?ID=415-homa9806755&displayType=deliveryTime",
      "method": "GET",
      "responseStatus": 200,
      "responseHeaders": {
        "date": "Fri, 22 Jan 2021 11:59:33 GMT",
        "server": "Apache",
        "expires": "Thu, 01 Dec 1994 16:00:00 GMT",
        "pragma": "no-cache",
        "cache-control": "no-cache",
        "vary": "Accept-Encoding",
        "content-encoding": "gzip",
        "p3p": "CP=\"Legal: http://www.ozon.com/pages/legal-privacy/privacy.html\"",
        "content-length": "256",
        "content-type": "text/html;charset=UTF-8",
        "set-cookie": "sid3=159.148.84.159.1611316773006064; path=/; expires=Sun, 22-Jan-23 11:59:33 GMT\nJSESSIONID=836DF1B6B81A85564B57A8B7117CF573; Path=/; Secure; HttpOnly",
        "via": "1.1 www.ozon.com",
        "keep-alive": "timeout=10, max=97",
        "connection": "Keep-Alive"
      },
      "responseBody": "<!DOCTYPE html><html><head><title>www.ozon.com</title></head><body><table border=0 cellspacing=0 ><tr><td><span> 11 days</span></td></tr></table></body></html>"
    }
]

5. Conclusion

In the bottom of the XHR, there is a responseBody parameter. It contains a html table with 11 days delivery value. If I need that data while scraping the product pages, I’d surely need a browser automation scraper to work on that.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.