Table of Contents
Heuristic filter ORACLE
Introduction
“The Merriam-Webster Dictionary”
oracle : one held to give divinely inspired answers or revelations
ze-filter's oracle is a set of tests about weak spam indicators. Some examples are :
- the messages contains a text/html part, but not a text/plain part.
- the message contains text in many colours
- the subject is entirely in capital letters (HI CAPS)
- the mailer is usually found in spams
- envelope From is NULL sender (<>), but the header sender isn't postmaster, MAILER-DAEMON, …
- there are two subject headers
- there are some HTML tags mostly found in spams.
- …
Heuristics may include, but not only, looking for some regular expressions inside some parts of messages.
The main goal of using this kind of heuristics isn't to use them to detect spam, as long as these are weak spam indicators. Heuristic filter isn't a main filtering method. But it can help to confirm the two main filtering methods : bayes filter and URL filtering.
The number of tests are not too big : less than 40 nowadays. Only really relevant checks are integrated into the oracle.
You find 4 check categories in ORACLE.
- CONN - Checks in these category are related to the SMTP connection/session
- MSGS - Checks done in the message as a whole : headers, …
- HTML - Checks text/html MIME part
- PLAIN - Checks text/plain MIME part
Configuration - Beginners users
Just enable it !
# SPAM_ORACLE # Do heuristic filtering # Syntax : ----- # VALUES : NO YES SPAM_ORACLE YES
If you want to use RBLs with the Oracle, take a look at “Expert users” section.
Configuration - Expert users
ze-filter's oracle uses two configuration files :
/etc/ze-filter/ze-tables
- this file is used to enable/disable each Oracle test and assign odds to them./etc/ze-filter/ze-oradata
- this file is used to define unwanted things and to assign odds to them. Unwanted things may be one of :- HTML-TAGS
- BAD-EXPR
- CHARSET
- BOUNDARY
- MAILER
To change the names of these files, you can edit ze-filter.cf file :
# ORACLE_DATA_FILE # Some oracle definitions # Syntax : ----- ORACLE_DATA_FILE ze-oradata # ORACLE_SCORES_FILE # Oracle scores # Syntax : ----- ORACLE_SCORES_FILE ze-tables
How to change original Oracle checks
If you want to enable/disable or change the values of tests, you shall edit ze-oradata
configuration file :
C05 DISABLE odds=1.000 SMTP client sending mail to spamtrap C06 DISABLE odds=1.000 Bad EHLO parameter C07 DISABLE odds=1.000 Myself EHLO parameter - forged M01 ENABLE odds=1.000 No HTML nor TEXT parts
If you you want to modify the list of Unwanted things used by some Oracle checks ( CHARSET | BAD-EXPR | BOUNDARY | MAILER | HTML-TAG ), you may edit ze-oradata
file :
HTML-TAGS odds=1.66 <script[^<>]*> HTML-TAGS odds=1.40 <script[^<>]+src=[^<>]+> HTML-TAGS odds=1.45 <span[^<>]*> BAD-EXPR odds=20.88 http[s]?://[^ /#]*#[0-9a-f] BAD-EXPR odds=1.00 http[s]?://[^ /&]*&#[0-9]{1,3} BAD-EXPR odds=1.03 http[s]?://[^ /@>\\n]*@ BAD-EXPR odds=6.92 http[s]?://[^ /]*[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3} BAD-EXPR odds=3.91 http[s]?://[^>\n\r *]+\\*http[s]?:// CHARSET odds=13.00 ^big5$ CHARSET odds=9.00 ^euc-kr$ CHARSET odds=4519.00 ^gb2312$
Odds ??? What's odds ???
In probability theory and statistics the odds in favour of an event or a proposition are the quantity p / (1 − p) , where p is the probability of the event or proposition. In other words, an event with m to n odds would have probability n/(m + n). For example, if you chose a random day of the week, then the odds that you would choose a Sunday would be 1/6, not 1/7. These 'odds' are actually relative probabilities.
- Example 1 : if you have 100 messages and the word viagra appears in 75 messages, you can say that viagra odds are 75/25, say 3.
- Example 2 : Odds, as used in ze-filter configuration files is the ratio of conditional probabilities. Consider you have 200 hams and 100 spams. The word viagra appears in 90 spams and on 4 hams. So the conditional odds here are : (90/100) / (4/200) → 45.
OBS :
- If the odds value is 1, that means that the event is neutral !!! I'm sure you've remarked this very interesting and important property of odds.
- If the odds value is < 1, that means that the event is more frequent in hams than in spams
- If the odds value is > 1, that means that the event is more frequent in spams than in hams
Debugging
What's triggering the Oracle
/var/log/ze-filter
shows the tests that have been done when checking a mail, that's a usefull if something get rejected. You will find the reason here
Mar 4 17:08:46 mx0 ze-filter[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - M02 text/html without text/plain ( 0.2) Mar 4 17:08:46 mx0 ze-filter[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - M13 RFC2822 headers compliance ( 1.0) Mar 4 17:08:46 mx0 ze-filter[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - H06 HTML tag/text ratio ( 0.5)
How to see how ze-filter is interpreting the Oracle configuration tables
$ ze-filter -t oradata $ ze-filter -t oracle-checks