Heuristic filter ORACLE

Introduction

From “The Merriam-Webster Dictionary”

oracle : one held to give divinely inspired answers or revelations

ze-filter's oracle is a set of tests about weak spam indicators. Some examples are :

  • the messages contains a text/html part, but not a text/plain part.
  • the message contains text in many colours
  • the subject is entirely in capital letters (HI CAPS)
  • the mailer is usually found in spams
  • envelope From is NULL sender (<>), but the header sender isn't postmaster, MAILER-DAEMON, …
  • there are two subject headers
  • there are some HTML tags mostly found in spams.

Heuristics may include, but not only, looking for some regular expressions inside some parts of messages.

The main goal of using this kind of heuristics isn't to use them to detect spam, as long as these are weak spam indicators. Heuristic filter isn't a main filtering method. But it can help to confirm the two main filtering methods : bayes filter and URL filtering.

The number of tests are not too big : less than 40 nowadays. Only really relevant checks are integrated into the oracle.

You find 4 check categories in ORACLE.

  1. CONN - Checks in these category are related to the SMTP connection/session
  2. MSGS - Checks done in the message as a whole : headers, …
  3. HTML - Checks text/html MIME part
  4. PLAIN - Checks text/plain MIME part

Configuration - Beginners users

Just enable it !

# SPAM_ORACLE
#     Do heuristic filtering
#  Syntax : -----
#     VALUES :  NO  YES 
SPAM_ORACLE                        YES

If you want to use RBLs with the Oracle, take a look at “Expert users” section.

Configuration - Expert users

ze-filter's oracle uses two configuration files :

  • /etc/ze-filter/ze-tables - this file is used to enable/disable each Oracle test and assign odds to them.
  • /etc/ze-filter/ze-oradata - this file is used to define unwanted things and to assign odds to them. Unwanted things may be one of :
    • HTML-TAGS
    • BAD-EXPR
    • CHARSET
    • BOUNDARY
    • MAILER
The names of these files will probably be changed in the future and both files will be merged in a single XML like coded file.

To change the names of these files, you can edit ze-filter.cf file :

# ORACLE_DATA_FILE
#     Some oracle definitions
#  Syntax : -----
ORACLE_DATA_FILE                   ze-oradata
 
# ORACLE_SCORES_FILE
#     Oracle scores
#  Syntax : -----
ORACLE_SCORES_FILE                 ze-tables

How to change original Oracle checks

If you want to enable/disable or change the values of tests, you shall edit ze-oradata configuration file :

C05   DISABLE      odds=1.000      SMTP client sending mail to spamtrap
C06   DISABLE      odds=1.000      Bad EHLO parameter
C07   DISABLE      odds=1.000      Myself EHLO parameter - forged
M01   ENABLE       odds=1.000      No HTML nor TEXT parts

If you you want to modify the list of Unwanted things used by some Oracle checks ( CHARSET | BAD-EXPR | BOUNDARY | MAILER | HTML-TAG ), you may edit ze-oradata file :

HTML-TAGS  odds=1.66    <script[^<>]*>
HTML-TAGS  odds=1.40   <script[^<>]+src=[^<>]+>
HTML-TAGS  odds=1.45   <span[^<>]*>
 
BAD-EXPR   odds=20.88  http[s]?://[^ /#]*#[0-9a-f]
BAD-EXPR   odds=1.00   http[s]?://[^ /&]*&#[0-9]{1,3}
BAD-EXPR   odds=1.03   http[s]?://[^ /@>\\n]*@
BAD-EXPR   odds=6.92   http[s]?://[^ /]*[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}
BAD-EXPR   odds=3.91   http[s]?://[^>\n\r *]+\\*http[s]?://
 
CHARSET    odds=13.00    ^big5$
CHARSET    odds=9.00     ^euc-kr$
CHARSET    odds=4519.00  ^gb2312$

Odds ??? What's odds ???

From Wikipedia :

In probability theory and statistics the odds in favour of an event or a proposition are the quantity p / (1 − p) , where p is the probability of the event or proposition. In other words, an event with m to n odds would have probability n/(m + n). For example, if you chose a random day of the week, then the odds that you would choose a Sunday would be 1/6, not 1/7. These 'odds' are actually relative probabilities.

  • Example 1 : if you have 100 messages and the word viagra appears in 75 messages, you can say that viagra odds are 75/25, say 3.
  • Example 2 : Odds, as used in ze-filter configuration files is the ratio of conditional probabilities. Consider you have 200 hams and 100 spams. The word viagra appears in 90 spams and on 4 hams. So the conditional odds here are : (90/100) / (4/200) → 45.

OBS :

  • If the odds value is 1, that means that the event is neutral !!! I'm sure you've remarked this very interesting and important property of odds.
  • If the odds value is < 1, that means that the event is more frequent in hams than in spams
  • If the odds value is > 1, that means that the event is more frequent in spams than in hams

Debugging

What's triggering the Oracle

/var/log/ze-filter shows the tests that have been done when checking a mail, that's a usefull if something get rejected. You will find the reason here

Mar  4 17:08:46 mx0 ze-filter[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - M02 text/html without text/plain (   0.2)
Mar  4 17:08:46 mx0 ze-filter[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - M13 RFC2822 headers compliance (   1.0)
Mar  4 17:08:46 mx0 ze-filter[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - H06 HTML tag/text ratio (   0.5)

How to see how ze-filter is interpreting the Oracle configuration tables

$ ze-filter -t oradata
$ ze-filter -t oracle-checks
doc/spam/heuristic_filter.txt · Last modified: 2018/02/09 16:59 by 127.0.0.1
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0