3. Data Pre-processing with Data reduction techniques in Python

Mansi Khatri
5 min readOct 28, 2021

AIM: Perform following Data Pre-processing tasks using python data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA, correlation

Datasets nowadays are very detailed, including more features in the model makes the model more complex, and the model may be overfitting the data. Some features can be the noise and potentially damage the model. By removing those unimportant features, the model may generalize better.

The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.

About Dataset

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn. datasets library

Now, let's see the information about the dataset

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Adding noise

The dataset now has 14 features now. Before applying the feature selection method, we need to split the data first. The reason is that we only select features based on the information from the training set, not on the whole data set. We should hold out part of the whole data set as a test set to evaluate the performance of the feature selection and the model. Thus the information from the test set cannot be seen while we conduct feature selection and train the model.

Splitting The Dataset

We will apply the feature selection based on X_train and y_train.

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Univariate Feature Selection

  • Univariate feature selection works by selecting the best features based on univariate statistical tests.
  • We compare each feature to the target variable, to see whether there is a statistically significant relationship between them.
  • When we analyze the relationship between one feature and the target variable we ignore the other features. That is why it is called ‘univariate’.
  • Each feature has its own test score.
  • Finally, all the test scores are compared, and the features with top scores will be selected.
  • These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
  • For regression: f_regression, mutual_info_regression
    For classification: chi2, f_classif, mutual_info_classif
  1. f_classif

Also known as ANOVA,

ANOVA Test

2. chi2

chi2 Test

3. mutual_info_classif

Comes in 2 types:

  1. for classification
  2. for regression
mutual_info_classif Test

Recursive Feature Elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

RFE using Random Forest Classifier

Differences Between Before and After Using Feature Selection

a. Before using Feature Selection

b. After using Feature Selection

Using f_classif
After using Feature Selection

There are clear differences in precision, recall, f1-score, and accuracy in both outputs. This shows the importance of using feature selection to increase the performance of the model.

Principal Component Analysis (PCA)

We can speed up the fitting of a machine learning algorithm by changing the optimization algorithm. A more common way of speeding up a machine learning algorithm is by using Principal Component Analysis (PCA).

If your learning algorithm is too slow because the input dimension is too high, then using PCA to speed it up can be a reasonable choice. This is probably the most common application of PCA. Another common application of PCA is for data visualization.

For a lot of machine learning applications, it helps to be able to visualize your data. Visualizing 2 or 3-dimensional data is not that challenging. The Iris dataset used is 4 dimensional. We will use PCA to reduce that 4-dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

So, now let's execute PCA for visualization on Iris Dataset

PCA Projection to 2D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation

Now, let's visualize the data frame, execute the following code:

Now let's visualize a 3D graph,

Summary

In this blog, I have tried to use different feature selection methods on the same data and evaluated their performances.

Comparing using all the features to train the model, the model performs better if we only use the remaining features after the feature selection.

After using feature selection, PCA has been used to visualize the data frame with reduced components in 2D as well as 3D.

--

--