2. Data Preprocessing Using Sklearn library

Mansi Khatri
4 min readAug 29, 2021

--

What is Data Preprocessing?

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. So the basically Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.

Data Description

here I have taken Iris one of the famous datasets. which has information about species. like, sepal’s height-width, petal’s height-width. and from that, it identifies whether the species is Iris-setosa, Iris-versicolor and, Iris-virginica

Data information

Data pre-processings

This below mentioned some steps are basic that we need to follow,

  1. Data Encoding
  2. Normalization
  3. Standardization
  4. Discretization
  5. Imputation of missing values

1. Data Encoding

We use data encoding so that we can convert categorical data or variables to numeric and binary values. Below are the encoding techniques

  • Label encoding
  • one hot encoding
  • dummy encoding
  • hash encoding
  • target encoding

Label Encoding

Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

Here I have converted species that are in form of objects previously, now it is in numeric form 0,1,2.

One Hot Encoding

One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector.

One hot Encoding

2. Normalization

In normalization, we convert the data features of different scales to a common scale which further makes it easy for the data to be processed for modeling. Thus, all the data features(variables) tend to have a similar impact on the modeling portion.

According to the below formula, we normalize each feature by subtracting the minimum data value from the data variable and then divide it by the range of the variable as shown–

3. Standardization

formula for Standardization

The preprocessing.scale(data) function can be used to standardize the data values to a value having the mean equivalent to zero and standard deviation as 1.

Python sklearn library offers us with StandardScaler() function to perform standardization on the dataset.

4. Discretization

Uniform Discretization Transform : Each bin has the same width in the span of possible values for the variable.

Quantile Discretization Transform: Each bin has the same number of values, split based on percentiles.

Clustered Discretization Transform: Clusters are identified and examples are assigned to each group.

5. Imputation of missing values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population). Missing data, or missing values, occur when no data value is stored for the variable in an observation. Sometimes missing values are caused by the researcher — for example, when data collection is done improperly or mistakes are made in data entry.

Ohh, yes… here SepalLengthCm has some missing value. now what we are going to do is put the mean value of that specific column in the missing value cell. and then again check still there is a missing value present or not.

Learning Outcomes

  • Filling in Missing values
  • Dealing with Categorical Data
  • Normalization of Dataset for improved results

I hope you like this stuff.

Github link: code

--

--