Category: Data Mining

AI code-writing assistant that understands data content and generates code

Post author By admin
Post date 06.03.2023
No Comments on AI code-writing assistant that understands data content and generates code

Recently I’ve encountred a client that predicts “in 6 month AI will be able to do much coding instead of man”.

…in years you’ll be able to on the fly, ask the AI to purchase a server, or create a website with X website builder… and basically, I bet it will write code on the fly on your demand where it connects to these tool’s APIs to really make things happen. It could do this now for some easy stuff but it’s unreliable and will mess up.

Now we’ve ancountered a interesing public repo, called Sketch. It’s AI code-writing assistant for Pandas (Python) users.

Tags data mining, Pandas, Python

Random Forest vs Gradient boosting

Post author By admin
Post date 21.06.2021
No Comments on Random Forest vs Gradient boosting

The objective of the task is to build a model so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (D1 through D1776), these are calculated properties that can capture some of the characteristics of the molecule – for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

Tags classification, data mining

Data Mining

Bagging and Random Forest

Post author By admin
Post date 08.06.2021
No Comments on Bagging and Random Forest

In this post we do several tasks performing the Bagging and the Random Forest Classificators.

We gradually develop classifier for the Bagging on randomized trees that in its final stage matches the Random Forest algorithm.

We’ll also build the RandomForestClassifier of sklearn.ensemble and learn of its quality depending on (1) number of trees, (2) max features used for each tree node, and (3) max tree depth.

Bagging is an ensemble algorithm that fits multiple models on different subsets of a training dataset, then combines the predictions from all models.

Random forest is an extension of bagging that also randomly selects subsets of features used in each data sample.

Bagging-and-Random-Forest-ENG Download

Tags classification, data mining, Random Forest

Data Mining

Sklearn, Random Forest

The objective of the task is to build a model so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

Source.

Data Mining

Sklearn Decision trees

We show how to work with Decision trees at the Sklearn library.

Sklearn.tree ; Sklearn tree examples

sklearn.decision_trees-ENG Download

Tags data mining, decision tree

Data Mining

Bike Sharing Demand Problem, part 2 – Sklearn SGD regression model, scaling, transformation chain and Random Forest nonlinear model

Post author By admin
Post date 07.04.2021
No Comments on Bike Sharing Demand Problem, part 2 – Sklearn SGD regression model, scaling, transformation chain and Random Forest nonlinear model

The Bike Sharing Demand problem requires using historical data on weather conditions and bicycle rental to predict the number of occupied bicycles (rentals) for a certain hour of a certain day.

In the original problem statement, there are 11 features available. The feature set contains both real, categorical, and binary data. For the demonstration, a training sample bike_sharing_demand.csv is used from the original data.

See the Bike Sharing Demand, part 1 of the task where we performed some initial problem analysis.

sklearn.case_part2-bike-sharing-ENG Download

Tags data mining, Linear Regression, Random Forest

Data Mining

Bike Sharing Demand Problem, part 1 – Sklearn regression model, scaling, transformation chain

Post author By admin
Post date 02.04.2021
No Comments on Bike Sharing Demand Problem, part 1 – Sklearn regression model, scaling, transformation chain

See the Bike Sharing Demand, part 2 of the task where we performed advanced problem analysis.

sklearn.case_part1-bike-sharing-ENG Download

Tags data mining, Linear Regression

Data Mining

Finding Classifier parameters on the grid, Sklearn.grid_search

Post author By admin
Post date 26.03.2021
No Comments on Finding Classifier parameters on the grid, Sklearn.grid_search

Let’s answer the question: how do the parameters of the model affect its quality? And how can we select the optimal parameters for the task to be solved? We will look at the grid_search module in the sklearn library and learn how to select model parameters from the grid.

Tags classification, data mining

Challenge Data Mining

Finding maximum likelihood estimate for the Bernoulli distribution parameter

Post author By admin
Post date 25.03.2021
No Comments on Finding maximum likelihood estimate for the Bernoulli distribution parameter

“Out of the 15 bank customers to whom the manager offered to connect autopayments, four agreed. Service activation is a binary feature that can be described by the Bernoulli distribution.”.

Let’s find the maximum likelihood estimate for the parameter p out of such a sample.

1) Likelihood function:

L(X_n, p) = ∏ p[X_i=1]*(1−p)[X_i=0] = p^4 * (1-p)^11

2) We find the maximum likelihood estimate for the parameter p.
We logarithm L(X_n, p) and get the following:

ln(p^4 * (1-p)^11) = 4*ln(p) + 11*ln(1-p)

3) Now we take its derivative and equate it to zero to find p.
[4ln(p) + 11ln(1-p)]` = 4 (ln(p))` + 11 (ln(1-p))` = 4/p + 11/(1-p) * (-1) = 0
Following: 4/p = 11/(1-p) => 4(1-p) = 11p => 15p = 4 => p = 4/15 =~ 0.26667.

Tags data mining

Challenge Data Mining

Linear regression in example: overfitting and regularization

Post author By admin
Post date 12.03.2021
No Comments on Linear regression in example: overfitting and regularization

We’ll also interpret the found linear dependencies. That means we check whether the discovered pattern corresponds to common sense. The main purpose of the task is to show and explain by example what causes overfitting and how to overcome it.

The code as an IPython notebook

Overfitting-Regularization-Example-Solution-1 Download

Tags data mining, Linear Regression

Свежие записи

Свежие комментарии

Архивы

Рубрики