-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Machine Learning Algorithms

Imagine that you need to design a spam-filtering algorithm, starting from this initial (over-simplistic) classification based on two parameters:
Parameter | Spam emails (X1) | Regular emails (X2) |
p1 - Contains > 5 blacklisted words | 80 | 20 |
p2 - Message length < 20 characters | 75 | 25 |
We have collected 200 email messages (X) (for simplicity, we consider p1 and p2 as mutually exclusive) and we need to find a couple of probabilistic hypotheses (expressed in terms of p1 and p2), to determine the following:
We also assume the conditional independence of both terms (it means that hp1 and hp2 contribute in conjunction to spam in the same way as they would alone).
For example, we could think about rules (hypotheses) like so: "If there are more than five blacklisted words" or "If the message is less than 20 characters in length" then "the probability of spam is high" (for example, greater than 50%). However, without assigning probabilities, it's difficult to generalize...