Cheerio.js, get items from html table into object

Suppose there is a table like below (1 info row only):

Minute (BPM)
Speed (RPM) Power, PSI Flow, PSI
Tool Sys
0-2500 0-250 1.8 HP 2.6-13.2 GPM SDS Max

How to scrape it using cheerio.js as a parser?

Case 1 (1 row only)


Redirect Node.js console output into file

node.exe index.js > scrape.log 2>&1

When executing file index.js we redirect all the console.log() output from console into a file scrape.log .


Remove empty html tags recursively

Sometimes we have the code with html tags that contain nothing but whitespace characters. Often those tags are nested. See a code below:


What regex might be used to find and remove those tags?

Obvious solution is <div>\s*?<\/div> .

\s stands for “whitespace character”. It includes [ \t\n\x0B\f\r]. That is: \s matches a space( ) or a tab (\t) or a line(\n) break or a vertical tab (\x0B) sometimes referred as (\v) or a form feed (\f) or a carriage return (\r) .

General case

In general case, we use the following regex:
<(?<tag>[a-z]+?)( [^>]+?|)>\s*?<\/(\k<tag>)>

where <tag> is a named match group: [a-z]+?

JAVA code

When applying it recursively we might use the following code, JAVA:

public static String removeEmptyTags(String html) {
        boolean compareFound = true;
        Pattern pattern = Pattern.compile("<(?<tag>[a-z]+?)( [^>]+?|)>\\s*?</(\\k<tag>)>", Pattern.MULTILINE | Pattern.DOTALL);
        while (compareFound) {
            compareFound = false;
            Matcher matcher = pattern.matcher(html);
            if(matcher.find()) {
                compareFound = true;
                html = matcher.replaceAll("");
        return html;

Simple JAVA scraper that handles user-agent, headers and cookies

How to handle cookie, user-agent, headers when scraping with JAVA? We’ll use for this a static class ScrapeHelper that easily handles all of this. The class uses Jsoup library methods to fetch from data from server and parse html into DOM document.


Map(), lambda() functions for 2-d arrays

Suppose we’ve a following array:

arr = [[ 5.60241616e+02,  1.01946349e+03,  8.61527813e+01],
 [ 4.10969632e+02 , 9.77019409e+02 , -5.34489688e+01],
 [ 6.10031512e+02, 9.10689615e+01, 1.45066095e+02 ]]

How to print it with rounded elements using map() and lamba() functions?

l = list(map(lambda i: list(map(lambda j: round(j, 2), i)), arr))

The result will be the following:

[[560.24, 1019.46, 86.15], 
 [410.97, 977.02, -53.45], 
 [610.03, 91.07, 145.07]]
Development Featured Review Web Scraping Software

Sequentum Enterprise review

Sequentum Enterprise is a powerful, multi-featured enterprise data pipeline platform and web data extraction solution. Sequentum’s CEO Sarah Mckenna doesn’t like to call it web scraping because, in its description, the web scraping refers to many different types of unmanaged and non-compliant techniques for obtaining web-based datasets. 


How to print out requestQueue info (Apify) at run time

The docs on requestQueue.getInfo().

After some unsuccessful tries I could have managed to get the requestQueue info output. Note, we run the function inside the Apify runtime environment:

Apify.main(async () => { ... }

Solution 1

We make the function async and add await to the getInfo() Promise call:

async function printRequestQueue (requestQueue){
   var { totalRequestCount, handledRequestCount, pendingRequestCount } = await requestQueue.getInfo();
   console.log(`Request Queue info:` );
   console.log(' - handled :', handledRequestCount);
   console.log(' - pending :', pendingRequestCount);
   console.log(' - total:'  , totalRequestCount); 

with the following result:

Request Queue info:
 - handled : 479
 - pending : 312
 - total: 791

Solution 2, using then/catch

In this case we do not need to make our function async since we catch the the getInfo() promise result thru .then(response).

function printRequestQueue (requestQueue){ 
  requestQueue.getInfo().then((response)=> { 
    console.log('total:', response.totalRequestCount); 
    console.log('handled:', response.handledRequestCount);
    console.log('pending:', response.pendingRequestCount);  
    console.log('\nFull response:\n', response); })
 .catch( (error) => console.log(error)); 

with the following result:

total: 791
handled: 479
pending: 312

Full response:
 { id: 'queue-name',
  name: 'queue-name',
  userId: null,
  createdAt: 2021-02-26T11:57:00.453Z,
  modifiedAt: 2021-02-26T11:58:47.988Z,
  accessedAt: 2021-02-26T11:58:47.989Z,
  totalRequestCount: 791,
  handledRequestCount: 479,
  pendingRequestCount: 312 

Node.js Cheerio scraper, replace element

let table = $('table');
if ($(table).has('br')) {  				     
    $("br").replaceWith(" ");

DOM selector excluding certain elements

Often we need to select certain html DOM elements excluding ones with certain names/ attributes/ attribute values. Let’s show how to do that.

Data Mining Development

Linear models, Sklearn.linear_model, Classification

In this post we’ll show how to build classification linear models using the sklearn.linear.model module.

The code as an IPython notebook