Let’s answer the question: how do the parameters of the model affect its quality? And how can we select the optimal parameters for the task to be solved? We will look at the grid_search module in the sklearn library and learn how to select model parameters from the grid.
Month: March 2021
“Out of the 15 bank customers to whom the manager offered to connect autopayments, four agreed. Service activation is a binary feature that can be described by the Bernoulli distribution.”.
Let’s find the maximum likelihood estimate for the parameter p out of such a sample.
1) Likelihood function:
L(Xn, p) = ∏ p[Xi=1]*(1−p)[Xi=0] = p^4 * (1-p)^11
2) We find the maximum likelihood estimate for the parameter p.
We logarithm L(Xn, p) and get the following:
ln(p^4 * (1-p)^11) = 4*ln(p) + 11*ln(1-p)
3) Now we take its derivative and equate it to zero to find p.
[4ln(p) + 11ln(1-p)]` = 4 (ln(p))` + 11 (ln(1-p))` = 4/p + 11/(1-p) * (-1) = 0
Following: 4/p = 11/(1-p) => 4(1-p) = 11p => 15p = 4 => p = 4/15 =~ 0.26667.
User-Agents by browsers
We attach here a link to the User-Agents presented/selected by most popular browsers. U-A’s total number is over 1600.
- Internet Explorer
- Firefox
- Chrome
- Safari
- Opera
node.exe index.js > scrape.log 2>&1
When executing file index.js we redirect all the console.log() output from console into a file scrape.log .
Remove empty html tags recursively
Sometimes we have the code with html tags that contain nothing but whitespace characters. Often those tags are nested. See a code below:
<div>
<div>
<div></div>
</div>
</div>
What regex might be used to find and remove those tags?
Obvious solution is <div>\s*?<\/div>
.
\s
stands for “whitespace character”. It includes [ \t\n\x0B\f\r]
. That is: \s
matches a space(
) or a tab (\t
) or a line(\n
) break or a vertical tab (\x0B
) sometimes referred as (\v
) or a form feed (\f
) or a carriage return (\r
) .
General case
In general case, we use the following regex:<(?<tag>[a-z]+?)( [^>]+?|)>\s*?<\/(\k<tag>)>
where <tag>
is a named match group: [a-z]+?
JAVA code
When applying it recursively we might use the following code, JAVA:
public static String removeEmptyTags(String html) { boolean compareFound = true; Pattern pattern = Pattern.compile("<(?<tag>[a-z]+?)( [^>]+?|)>\\s*?</(\\k<tag>)>", Pattern.MULTILINE | Pattern.DOTALL); while (compareFound) { compareFound = false; Matcher matcher = pattern.matcher(html); if(matcher.find()) { compareFound = true; html = matcher.replaceAll(""); } } return html; }
How to handle cookie, user-agent, headers when scraping with JAVA? We’ll use for this a static class ScrapeHelper
that easily handles all of this. The class uses Jsoup library methods to fetch from data from server and parse html into DOM document.
We’ll also interpret the found linear dependencies. That means we check whether the discovered pattern corresponds to common sense. The main purpose of the task is to show and explain by example what causes overfitting and how to overcome it.
The code as an IPython notebook
Suppose we’ve a following array:
arr = [[ 5.60241616e+02, 1.01946349e+03, 8.61527813e+01],
[ 4.10969632e+02 , 9.77019409e+02 , -5.34489688e+01],
[ 6.10031512e+02, 9.10689615e+01, 1.45066095e+02 ]]
How to print it with rounded elements using map() and lamba() functions?
l = list(map(lambda i: list(map(lambda j: round(j, 2), i)), arr))
print(l)
The result will be the following:
[[560.24, 1019.46, 86.15],
[410.97, 977.02, -53.45],
[610.03, 91.07, 145.07]]

Sequentum Enterprise is a powerful, multi-featured enterprise data pipeline platform and web data extraction solution. Sequentum’s CEO Sarah Mckenna doesn’t like to call it web scraping because, in its description, the web scraping refers to many different types of unmanaged and non-compliant techniques for obtaining web-based datasets.
in the post will reviewed a number of metrics for evaluating classification and regression models. For that we use the functions we use of the sklearn library. We’ll learn how to generate model data and how to train linear models and evaluate their quality.