Puppeteer async scraper with browsers number to be tuned based on CPU capacity

Recently we’ve got a tricky website of dynamic content to scrape. The data are loaded thru XHRs into each part of the DOM (HTML markup). So, the task was to develop an effective scraper that does async while using reasonable CPU recourses.

Given

Over 160K parcel IDs for quering government data website https://loraincountyauditor.com/gis/report/Report.aspx?pin=0400014105002 (use US proxies to view it).

What to get

Tax info from Taxes dynamic area for each input parcel:

RAM & CPU challenge in browser automation

The website is a dynamic one requiring browser automation for its scrape. Yet, often the challenge is how to handle CPU and RAM in order to have a successful scrape. So, as you launch the code, monitor the CPU and RAM usage usign Task Manager, see below:

The script is designed to run manually (by 2-nd launch parameter, see it in next section). We regulate the number of browsers to scrape the website.

The launch parameters and service file & dir

node browsers-async.js 0 2 20000

The script launch/run is straightforward. We just mention the parameters:

0 – headless browser {any other than 0} – headfull browser, one that works out full sized browser GUI.
2 – number of browsers for asynchronious work.
20000 – timeout value, max waiting time to find an HTML table data selector.

Note, the input data, parcels ids are taken from an input file parsels.txt (a sample is attached below). For the output we’ll write result files into a separate directory named data, you needing to create that folder in the project root.

parcels-1 Download

Error handling

The website uses dynamic HTML markup, so the main error cause is that data are absent. Even though we set timeout for waiting and the browser performs scroll down.

For the wait we set up $selector_wait_check variable holding the timeout. So, if the server still doesn’t provide any data, we need to handle that error (Timeout error). $selector_timeout contains the number in milliseconds of max waiting time to grasp an HTML table data selector. One may set it thru the script console launch parameters.

const $selector_timeout = args.length > 2 ? args[2] : 20000; // default "selector wait" timeout 
const tax_selector_check = "tbody[data-tableid=\"Taxes0\"] > tr[class=\"report-row\"] > td"; 
const tax_selector = "tbody[data-tableid=\"Taxes0\"]";
...
let $selector_wait;
await page.waitForSelector(tax_selector_check, {timeout: $selector_timeout} )
  .then(() => {  	//console.log("Success with selector."); 	
}).catch((err)=> { 
  console.log("Failure to get tax info HTML element. ERR:", err);
  failed_pool.push(chunk[i]);
  dump(failed_pool);
  $selector_wait = 0;								   
} );  

if ($selector_wait===0){
  continue;
}

Proxying

Since the website is US IPs restricted, we use the ProtonVPN service to mimic US requests’ origin. The ProtonVPN application gets activated at the OS level (Windows in our case), so no code changes are inflicted.

Load timeout

The data load timeout makes the browser wait more or less till needed data for the tax info are loaded. It’s the third launch parameter, see above. So we’ve played around with that value to reach the best performance. Depending on the different load timeout value, we had a different scraper performance rate. See the table below:

Dynamic data load timeout, seconds	Performance rate, success vs all request
10	80%
20	95%
30	97%

Note, the figures pertain to this specific website.

Data post-processing

Since we scrape each product into a separate [JSON] file, we need to concatenate the data. The easiest way is to use Windows utility.

Get-Content *.json | Set-Content result.json

We then append commas to the end of each line and bracket the whole content with [ ] using Notepadd++ for that. Thus we get a valid JSON:

[ { "Full Year Tax":"145,26$"… }, { "Full Year Tax":"139,00$"… }, { "Full Year Tax":"100,00$"… } ]

The scraped data issues

Sometimes the website returns an empty data table like the following:

So the parcels with the missed data as well as Timeout error parcels we discard from output but save them into the failed-parcels.txt file.

Whole code

The code is in the public github repo, please clone or fork it to try.

Свежие записи

Свежие комментарии

Архивы

Рубрики