When performing web scaping I first need to evaluate a site’s difficulty level. That is how difficult is it for the scrape procedures? Do its pages make extra XHR (Ajax) calls? Based on that I choose whether to use (1) Request scraper (eg. Cheerio) or (2) Browser automation scraper (eg. Puppeteer).
So, I’ve discovered an Apify Web Page Analyzer, a free scraper agent that analyses a target site and returns inclusive JSON data of the target web page. The presence of XHR (AJAX) helps me to decide what type of crawler to use for scraping that website.
Web Page Analyzer
The page analyzer makes searches based on its input data. I don’t need to provide search terms though for single [product] page analysis. I only supply the product url and get the needed output.
1. Clone the actor (Apify agent)
git clone https://github.com/apify/actor-page-analyzer.git
Note: you need to install Node.js
and Apify
package on your PC/Mac/Linux to be able to run Node.js Apify agent/actor.
2. Add input data
Inside the project folder go to src\apify_storage\key_value_stores\default
In that folder you create and save a file INPUT.json
. Supply the url with the actual product so that scraper will return the analytic data. The input example data:
{
"url": "https://www.ozon.com/Sanimaster_PE80.html",
"searchFor": [ "test" ]
}
Both parameters are required. We’ve provided the second one, searchFor, value just for representation.
3. Run the actor (I did it locally)
Move to the folder src and run node index.js
After the scraper does his job you’ll be able to find an OUPUT.json file in the same directory as INPUT.json is.
4. Analyze the OUTPUT.json
Scraper run log, shortened:
INFO: System info {"apifyVersion":"0.14.15","apifyClientVersion":"0.5.26","osTyp
e":"Windows_NT","nodeVersion":"v11.10.0"}
2021-01-22T11:59:31.378Z '0s' 'Loading data from input'
INFO: Launching Puppeteer {"stealth":true,"headless":true,"args":["--no-sandbox"
,"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.3
6 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"],"defaultViewport":{"wi
dth":1366,"height":768},"pipe":true}
================================
https://www.ozon.com/Sanimaster_PE80.html
================================
2021-01-22T11:59:32.283Z '0.91s' 'analysisStarted'
2021-01-22T11:59:32.472Z '0.19s' 'scrapping started'
2021-01-22T11:59:33.302Z '0.83s' 'initial response'
2021-01-22T11:59:33.305Z '0s' 'start of html: <!DOCTYPE html><html>...
2021-01-22T11:59:33.477Z '0.17s' 'metadata searched'
2021-01-22T11:59:33.483Z '0.01s' 'json-ld searched'
2021-01-22T11:59:33.562Z '0.08s' 'schema org searched'
2021-01-22T11:59:33.575Z '0.01s' 'initial html searched'
2021-01-22T11:59:37.292Z '3.72s' 'loaded'
2021-01-22T11:59:37.293Z '0s' 'requests'
2021-01-22T11:59:37.384Z '0.09s' 'xhrRequests searched'
2021-01-22T11:59:47.418Z '10.03s' 'html'
2021-01-22T11:59:47.497Z '0.08s' 'html searched'
2021-01-22T11:59:47.514Z '0.02s' 'window properties'
2021-01-22T11:59:47.516Z '0s' 'window properties searched'
done
2021-01-22T11:59:47.517Z '0s' 'scrapping finished'
2021-01-22T11:59:47.518Z '0s' 'Force write of output with await'
2021-01-22T11:59:47.538Z '0.02s' 'Analyzer finished'
OUTPUT.json
By reviewing that file I’ve discovered an XHR that makes call to the server to retrieve a product delivery time:
"xhrRequests": [
...
{
"url": "https://www.ozon.com/availability3.jsp?ID=415-homa9806755&displayType=deliveryTime",
"method": "GET",
"responseStatus": 200,
"responseHeaders": {
"date": "Fri, 22 Jan 2021 11:59:33 GMT",
"server": "Apache",
"expires": "Thu, 01 Dec 1994 16:00:00 GMT",
"pragma": "no-cache",
"cache-control": "no-cache",
"vary": "Accept-Encoding",
"content-encoding": "gzip",
"p3p": "CP=\"Legal: http://www.ozon.com/pages/legal-privacy/privacy.html\"",
"content-length": "256",
"content-type": "text/html;charset=UTF-8",
"set-cookie": "sid3=159.148.84.159.1611316773006064; path=/; expires=Sun, 22-Jan-23 11:59:33 GMT\nJSESSIONID=836DF1B6B81A85564B57A8B7117CF573; Path=/; Secure; HttpOnly",
"via": "1.1 www.ozon.com",
"keep-alive": "timeout=10, max=97",
"connection": "Keep-Alive"
},
"responseBody": "<!DOCTYPE html><html><head><title>www.ozon.com</title></head><body><table border=0 cellspacing=0 ><tr><td><span> 11 days</span></td></tr></table></body></html>"
}
]
5. Conclusion
In the bottom of the XHR, there is a responseBody parameter. It contains a html table with 11 days
delivery value. If I need that data while scraping the product pages, I’d surely need a browser automation scraper to work on that.