Categories
Data Mining Development

Work with inbuilt datasets of Sklearn and Seaborn libraries

In the post we will show how to generate model data and load standard datasets using the sklearn datasets module. We use sklearn.datasets in the Python 3.

The code of an iPython notebook

Categories
Data Mining

Linear regression and Stochastic Gradient Descent

In this post we’ll show how to make a linear regression model for a data set and perform a stochastic gradient descent in order to optimize the model parameters. As in a previous post we’ll calculate MSE (Mean squared error) and minimize it.

Categories
Data Mining

Linear Regression application for data analysis and scientific computing

In this post we’ll share with you the vivid yet simple application of the Linear regression methods. We’ll be using the example of predicting a person’s height based on their weight. There you’ll see what kind of math is behind this. We will also introduce you to the basic Python libraries needed to work in the Data Analysis.

The iPython notebook code

Categories
Data Mining

Classification vs Clustering in Machine Learning

In the post we share some basics of classification and clustering in Machine learning. We also review some of the cluster analysis methods and algorithms.

Categories
Data Mining

Weibull distribution & sample averages approximation using Python and scipy

In this post we share how to plot distribution histogram for the Weibull ditribution and the distribution of sample averages as approximated by the Normal (Gaussian) distribution. We’ll show how the approximation accuracy changes with samples volume increase.

One may get the full .ipynb file here.

Categories
Development

Invalid data, what it is?

Often we see “invalid data”, “clean data”, “normalize data”. What does it mean as to practical data extraction and how does one deal with that? One shot is better than 1000 words though:

The value k. A. in the highlighted cells might jeopardize data processing if not properly dealt with.
Categories
Data Mining

Simple text analysis with Python

Finding the most similar sentence(s) to a given sentence in a text in less than 40 lines of code 🙂

Categories
Data Mining

Big Data, Data Analytics, Data Analysis, Data Mining, Data Science & Machine Learning

In this post, we’d like to share some of the most interesting terms that are used in today’s science and IT world. We think you will benefit from getting familiar with these modern tech-age expressions.

Categories
Data Mining

Distributed File System Implementations and MapReduce strategy

We have already mentioned the MapReduce distributed computation style in data analysis for computing clusters in the previous post. Here we want to touch more on the matter of implementation of this strategy for distributed hardware.

Categories
Data Mining

Implementing frequent itemsets algorithm thru MapReduce

The problem of finding frequent itemsets in data analysis is described in this post, and here i state the practical steps for finding the frequent itemsets thru MapReduce.