Categories
Development

Simple JAVA scraper that handles user-agent, headers and cookies

How to handle cookie, user-agent, headers when scraping with JAVA? We’ll use for this a static class ScrapeHelper that easily handles all of this. The class uses Jsoup library methods to fetch from data from server and parse html into DOM document.

Categories
Challenge Data Science

Linear regression in example: overfitting and regularization

In the post we will set up a linear model to predict the number of bike rentals depending on the calendar characteristics of the day and weather conditions. We will choose the weights of the features so that to catch all the linear dependencies in the data and at the same time do not take into account extra features. This way the model will not overfit and will make fairly accurate predictions on new data.

We’ll also interpret the found linear dependencies. That means we check whether the discovered pattern corresponds to common sense. The main purpose of the task is to show and explain by example what causes overfitting and how to overcome it.

The code as an IPython notebook

Categories
Development

Map(), lambda() functions for 2-d arrays

Suppose we’ve a following array:

arr = [[ 5.60241616e+02,  1.01946349e+03,  8.61527813e+01],
 [ 4.10969632e+02 , 9.77019409e+02 , -5.34489688e+01],
 [ 6.10031512e+02, 9.10689615e+01, 1.45066095e+02 ]]

How to print it with rounded elements using map() and lamba() functions?

l = list(map(lambda i: list(map(lambda j: round(j, 2), i)), arr))
print(l)

The result will be the following:

[[560.24, 1019.46, 86.15], 
 [410.97, 977.02, -53.45], 
 [610.03, 91.07, 145.07]]
Categories
Development Featured Review Web Scraping Software

Sequentum Enterprise review

Sequentum

Sequentum Enterprise is a powerful, multi-featured enterprise data pipeline platform and web data extraction solution. Sequentum’s CEO Sarah Mckenna doesn’t like to call it web scraping because, in its description, the web scraping refers to many different types of unmanaged and non-compliant techniques for obtaining web-based datasets. 

Categories
Data Science

Sklearn, Classification and Regression metrics

in the post will reviewed a number of metrics for evaluating classification and regression models. For that we use the functions we use of the sklearn library. We’ll learn how to generate model data and how to train linear models and evaluate their quality.

The code as an IPython notebook

Categories
Data Science

Linear models, Sklearn.linear_model, Regression

In this post we’ll show how to build regression linear models using the sklearn.linear.model module.

See also the post on classification linear models using the sklearn.linear.model module.

The code as an IPython notebook

Categories
Development

How to print out requestQueue info (Apify) at run time

The docs on requestQueue.getInfo().

After some unsuccessful tries I could have managed to get the requestQueue info output. Note, we run the function inside the Apify runtime environment:

Apify.main(async () => { ... }

Solution 1

We make the function async and add await to the getInfo() Promise call:

async function printRequestQueue (requestQueue){
   var { totalRequestCount, handledRequestCount, pendingRequestCount } = await requestQueue.getInfo();
   console.log(`Request Queue info:` );
   console.log(' - handled :', handledRequestCount);
   console.log(' - pending :', pendingRequestCount);
   console.log(' - total:'  , totalRequestCount); 
}

with the following result:

Request Queue info:
 - handled : 479
 - pending : 312
 - total: 791

Solution 2, using then/catch

In this case we do not need to make our function async since we catch the the getInfo() promise result thru .then(response).

function printRequestQueue (requestQueue){ 
  requestQueue.getInfo().then((response)=> { 
    console.log('total:', response.totalRequestCount); 
    console.log('handled:', response.handledRequestCount);
    console.log('pending:', response.pendingRequestCount);  
    console.log('\nFull response:\n', response); })
 .catch( (error) => console.log(error)); 
}

with the following result:

total: 791
handled: 479
pending: 312

Full response:
 { id: 'queue-name',
  name: 'queue-name',
  userId: null,
  createdAt: 2021-02-26T11:57:00.453Z,
  modifiedAt: 2021-02-26T11:58:47.988Z,
  accessedAt: 2021-02-26T11:58:47.989Z,
  totalRequestCount: 791,
  handledRequestCount: 479,
  pendingRequestCount: 312 
}
Categories
Development

Node.js Cheerio scraper, replace element

let table = $('table');
if ($(table).has('br')) {  				     
    $("br").replaceWith(" ");
}
Categories
Development

DOM selector excluding certain elements

Often we need to select certain html DOM elements excluding ones with certain names/ attributes/ attribute values. Let’s show how to do that.

Categories
Data Science Development

Linear models, Sklearn.linear_model, Classification

In this post we’ll show how to build classification linear models using the sklearn.linear.model module.

The code as an IPython notebook