Bayesian Filtering
The Bayesian filter is a recently elaborated anti-spam technique and one of the most important ones. It’s build in so many email applications these days (like Outlook 2003, Mozilla Thunderbird, Apple Mail, G-Lock SpamCombat).
The Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the occurrences of this event in the past. This approach is used to identify spam. If some piece of text occurred mostly in spam emails but not in legitimate mail, then it would be reasonable to suppose that this email is probably spam.
To filter mail using the Bayesian technology, you need to generate a database of words collected from spam and legitimate mail. Then a probability value is assigned to each word; the probability is based on the calculations that take into account how often that word occurs in spam as opposed to legitimate mail. After the legitimate and spam databases are created during an initial training period, the word probabilities can be calculated and the Bayesian filter is ready for use. When a new mail arrives, it is broken into words and the most significant words are singled out. From these words, the Bayesian filter calculates the probability of a new message being spam or not. If the probability is greater than a spam threshold, say 0.9, the message is classified as spam.
It is important to note that the analysis of spam and legitimate mail is performed on the mail the particular user (organization, company, etc.) receives, and therefore the Bayesian filter is adjusted to this particular person, company, or organization. For example, a financial institution may receive a lot of emails with the "mortgage" word and would get a lot of false positives if using an outdated anti-spam filter. The Bayesian filter analyzes the entire message with the word "mortgage", and concludes whether this email is spam or legitimate basing NOT only on a single keyword "mortgage". The Bayesian approach to filter spam is highly effective – spam detection rates of over 99.7% can be achieved with a very low number of false positives!
Table of contents | Page list for this chapter | Next page