-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Machine Learning Algorithms

Sometimes, a dataset can contain missing features, so there are a few options that can be taken into account:
The first option is the most drastic one and should only be considered when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the Imputer
class, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).
The following snippet shows an example that's using the three approaches...