Wednesday, May 26, 2010

Bogofilter Anti-Spam "how to" (bayesian spam filter)

Bogofilter is a bayesian spam filter that will index two mailboxes of mail, one good mail and one known spam, to classify mail.
Bogofilter is a mail filter that classifies mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections.The statistical technique is known as the Bayesian technique and its use for spam was described by Paul Graham in his article A Plan For Spam in August 2002. Gary Robinson, in his web log Rants (September 2002), suggested some refinements for improved discrimination between spam and ham. Bogofilter's primary algorithm uses the f(w) parameter and the Fisher inverse chi-square technique that he describes. Paul Graham's new article Better Bayesian Filtering (January 2003) suggests some useful parsing improvements.
Bogofilter is run by an MDA script to classify an incoming message as spam or ham (using word lists stored by BerkeleyDB). Bogofilter provides processing for plain text and HTML. It supports multi-part mime message with decoding of base64, quoted-printable, and uuencoded text and ignores attachments, such as images. Bogofilter.Sourceforge


Getting Started

The first thing to check is whether bogofilter is installed and you can see it in your path. You can do a "which bogofilter" and if you do not see it then make sure it is installed. If not go get a package or build it from source. Once you have it make sure bogofilter is in the path that users on the system can see.
Secondly, take some time and put all of your mail that you know is spam into a separate mail box. This mail box will be named "SPAM" for our example. Then put all of the mail you know is good mail, mail you want to receive in the future, into another box. We will use "archive" for good mail mail. If you have other email that you is good mail, but is in another place then we will use the mailbox "saved" for that example.


Running the job from cron

To use bogofilter in its easiest capacity you can choose to run a cron job every 15 minutes or so with the following lines. You can run these commands on one line or with line separators like we have below for easier reading.
rm /home/username/.bogofilter/wordlist.db; \  
  bogofilter -s < ~username/Mail/SPAM;     \
  bogofilter -n < ~username/Mail/archive;  \
  bogofilter -n < ~username/Mail/saved
These lines will remove the wordlist.db database bogofilter makes and re-make the list. The argument "-s" is for mailboxes that contain know samples of spam you have received. The argument "-n" is for non-spam or good mail you have. For our example we have labeled all mail in the SPAM mailbox as spam and all good mail in "archive" and "saved" as non-spam or mail we want to recieve.


How does it work?

The idea is that you can edit your mail folders, moving mail from one box to another and not have to worry about the database containing false positive or false negatives because the database is always remade when your cron job runs. You may find this method a lot easier to explain to any users you have to support instead of trying to get them to update the bogofilter database to remove mis-classified mail.


Setting up the mda

The last task is to setup your mail delivery agent (mda) to filter out the mail bogofilter marked as spam. All mail filtered through bogofilter will have a X-header called "X-Bogosity" with a spam rating. Procmail is our mda of choice and you can find detailed examples of procmail (.procmailrc) on the main page /

No comments: