Bayesian Filtering with ze-filter

What's bayesian filtering and ze-filter's implementation

Statistical (Bayesian) filtering is being used by many mail filters. Although first research results about using bayesian filters to filter spam were first presented in 1998 by Sahami, Heckerman, Dumais and Horwitz, they were implemented after Paul Graham posted a page about, four years later, it in his blog : A Plan for Spam.

There are many implementations of Bayesian filtering. Most of them are quite different and obviously each implementation tries to be the best one… Although the basic idea is the same, differences are in the way tokens are extracted from messages and the way a score is assigned to the message and how this score is handled by the filter. But, to be precise, none of them are really “bayesian classifiers”. Some of them which are near bayesian filters, employ a kind of “brand name” to differentiate their filter from the others.

Currently, ze-filter uses the score from bayesian filter to confirm or invalidate the score assigned by other filtering criteria. This is what is called “boosting”.

ze-filter is very concerned by speed, so message handling time when bayesian filtering is enabled remains almost the same.

To let ze-filter do bayesian filtering, the roadmap is :

Create a corpus of messages (Hams and Spams). Bayesian filtering will be based on the content of these messages.
Create the tokens database from the corpus of messages and let it be updated regularly
Modify ze-filter configuration files to do bayesian filtering

If you live in France, and you're filtering messages for a french community, instead of training the statistical filter yourself, you can grab the learning database (rsync), once a day. To ensure privacy, tokens are MD5 encrypted.

To put this in place, take a look at get-bayes script inside /var/ze-filter/cdb directory.

Configuration

Usually, the only thing to do on configuration files, is to enable bayesian filtering. The other parameters are default parameters and usually don't need any tuning.

  BAYESIAN_FILTER            YES
  BAYES_MAX_MESSAGE_SIZE     200K
  BAYES_MAX_PART_SIZE        30K
  DB_BAYES                   ze-bayes.db
  BAYES_HAM_SPAM_RATIO       1000 
  BAYES_NB_TOKENS            64
  BAYES_UNKNOWN_TOKEN_PROB   500

The only options you should modify are BAYESIAN_FILTER, BAYES_MAX_MESSAGE_SIZE and BAYES_MAX_PART_SIZE

You'll need to modify /var/ze-filter/cdb/Makefile file to add the tokens database (ze-bayes.db) to the objects to maintain.

OBJ     = ze-urlbl.db ze-policy.db ze-rcpt.db ze-bayes.db

Next section explains how to create and maintain the tokens database.

How to create a corpus of messages

In theory, a statistical filter classes incoming messages based on the knowledge it has of usual messages (spams and hams) received by the final user. The filter should know how recipient mailbox looks like. This is what people call “learning or training the filter”.

So, it must have access to a set of messages representative (both qualitative and quantitative) of the final user mailbox.

In pratice, for many reasons, out of the scope here, it's impossible to constitute a perfect set of messages, mainly if the filter is to be applied to many recipients.

Hopefully, although this apparent difficulty, it's possible to constitute a corpus of messages good enough to reduce the quantity of spam to some acceptable level.

There are many ways to create a spam corpus and to train a filter (Train-on-everything, Train-on-errors, Train-until-mature, Train-until-no-errors, …). But if you examine them deeply, none of them really match any theoretical statistical model of bayesian filtering. Each one has its pros and cons.

This is how I manage the corpus of messages in our production server. This is an idea and may not be the best choice for your environnement. But you can surely begin this way. The ideas are :

Different users receive almost the same spams
Spam evolves faster than hams
The set of legitimate messages of some user can be, roughly speaking, considered as the linear combination of some categories of messages.
The corpus of messages should include all kind of messages representing qualitatively recipients mailboxes. Their presence in the corpus is more important than the frequency they really appear (but don't cheat too much).
A good corpus include at least 10000 hams and 10000 spams.

Spam corpus

To create my spam corpus, I use :

spams I receive in my own mailbox
spams I receive in some spamtraps I've put in some webservers
spams offered by friend when they find one which isn't detected by the filter

All these spams are classed by source and by month in different files (e.g. spamtrap-2006-09.sbox …).

The corpus of spam messages is updated daily to add new fresh messages and to remove messages older than 6 months.

Ham corpus

To create my ham corpus, I use :

messages from my own mailbox (a computer scientist).
messages from some colleagues mailboxes (an engineering manager, a geologist, an administrator).
messages from some french discussion lists (lawyers, college education, computer sciences…). In order to preserve confidentiality, I asked these friends to give me access to a set of their messages in a way I could run a program without reading their messages. The criteria they should use to select messages is simply : messages they want to be classed as normal messages.
messages from some english discussion lists (computer science)
some usual or typical newsletters

The corpus of ham messages is updated each 2-3 months to add wrongly classified messages and to remove messages older than 3 years.

Some notes...

Above ideas seems too empirical but they really aren't. Filter results are more sensitive to the way the filter tokenizer works than to the quality of the corpus of messages. But this doesn't means the corpus isn't important : it MUST roughly match the current flow of messages. It's up to you to roughly identify the kind of messages appearing in real traffic and roughly select their proportion in the corpus.

In some way, creating a good corpus of messages is an iterative process :

Select some set of test messages (not used in training) : spams and hams
add spams and hams to the corpus of messages
create the training database from the corpus of messages
check the training database agains the set of test messages
the filter efficiency is good enough ?
1. No → GOTO 2
2. Yes → GOTO 6
replace inline training database by the new database

You don't need to repeat all this iterative process each time you update the training database, but you surely have to check it from time to time.

Updating Training Database

The simplest way to maintain the training database is to use the contents of the bayes-toolbox directory you'll find inside ze-filter distribution tree. This directory contains a Makefile with rules to create the tokens database from mailboxes, and two sample mailboxes (ham and spam).

You can install this directory just after installing ze-filter. Do it with the following commands at ze-filter distribution root directory :

make install
make install-learn

Copy your spam mailboxes into this directory. Spam mailboxes have a .sbox extension.
Copy your ham mailboxes into this directory. Ham mailboxes have a .hbox extension.
To ease organisation of this directory, you can organise mailboxes in as many files you want.

When you've put put all mailboxes together, you can simply type make, and everything will be done.

Pertinent features of each message/mailbox will be extracted to generate a .tok file. E.g. features from spamtrap-0609.sbox will be extracted into a spamtrap-0609.tok. Features from .tok files will be aggregated into training database, which name will be ze-bayes.txt.

If you add or update a mailbox, typing make will recreate the training database and update only what's needed.

This Makefile needs GNU make.
Disk space needed to create training database (with temporary files) may be important - something like 1 GByte. To save some space, you can remove “.sbox” and “.hbox” files and save only “.tok” files.

After the training database is created or updated, install it at ze-filter configuration directory.

Complete training sequence of commands is something similar to :

cd /var/ze-filter/bayes-toolbox
make
make install
cd /var/ze-filter/cdb
make

learn.sh, a script found inside bayes-toolbox directory will ease addition of messages into ze-filter learning chain.

Testing (command line tool)

ze-bayes-tbx is a command line tool needed to perform most tasks related to the bayesian filter, other than the online filter. Functions related to training the filter are called from the Makefile inside bayes-toolbox directory. Probably you won't need to use these functions.

Most of the time you'll use this tool to evaluate the quality of the learning database and the efficiency of the filter. You'll type something like :

$ ze-bayes-tbx -c -x -p mailbox
# Checking mailbox Ham.2006
  0 :  0.000  1426 ********************************************************************************
  1 :  0.050     5 *****
  2 :  0.100    20 ********************
  3 :  0.150    23 ***********************
  4 :  0.200    12 ************
  5 :  0.250     6 ******
  6 :  0.300     7 *******
  7 :  0.350     5 *****
  8 :  0.400     6 ******
  9 :  0.450     9 *********
 10 :  0.500     4 ****
 11 :  0.550     1 *
 12 :  0.600     2 **
 13 :  0.650     0
 14 :  0.700     3 ***
 15 :  0.750     0
 16 :  0.800     1 *
 17 :  0.850     0
 18 :  0.900     0
 19 :  0.950     1 *
 20 :  1.000     0
    :         1531 Messages
$

The previous command suppose you've already installed the training database at standard location (/var/ze-filter/cdb/ze-bayes.db). If this isn't the case or if you want to use a database installed elsewhere, use the option -b to specify a different location.

Understanding bayesian filter results

When applied to some message, the bayesian filter assigns some score to it. This score is a number in the interval [0,1]. Spam scores are near 1 and Ham scores are near 0.

Currently, bayesian score is used to confirm or invalidate the score assign by other content filtering methods : pattern matching, heuristic filter and URL filtering. The rule is simple :

Let b be the score given by the bayesian filter
Let c be the score given by other ze-filter content filters

If b > 0.50 → c = MAX(c, 1)
If b > 0.75 → final score is equals to c mutiplied by some coefficient (greater than 1).
If b < 0.25 → final score is equals to c divided by some coefficient (greater than 1).

Table of Contents