-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Machine Learning Algorithms

An unnormalized dataset with many features contains information proportional to the independence of all features and their variance. Let's consider a small dataset with three features, generated with random Gaussian distributions:
Sample dataset containing three Gaussian features with different standard deviations
Even without further analysis, it's obvious that the central line (with the lowest variance) is almost constant and doesn't provide any useful information. Recall from Chapter 2, Important Elements in Machine Learning, that the entropy H(X) is quite small, while the other two variables carry more information. A variance threshold is, therefore, a useful approach to remove all those elements whose contribution (in terms of variability and so, information) is under a predefined level. The scikit-learn library provides the VarianceThreshold
class which can easily solve this problem. By applying it to the previous dataset, we get the following result:
from...