]> Bayes filtering with SpamAssassin and amavis 🌐:aligrant.com

Bayes filtering with SpamAssassin and amavis

Alastair Grant | Saturday 14 December 2019

I've received a few pieces of spam recently, which has mildly irritated me because I used to have pretty accurate spam filtering in place. I make use of SpamAssassin and amavisd-new (catchy name), which for years have been faultless, but I recently rebuilt my mail server and the same configuration doesn't seem to be working.

SpamAssassin

SpamAssassin is a fairly well established spam filtering utility can be run on your MTA (mail transfer agent) and deal with spam as it arrives at your server, or mailbox.  It makes use of various techniques to identify spam, but frankly I find there are two highly effective features: external DNS blacklists, and bayes filtering.

DNS black lists are simply lists of mail servers known to be sending spam - they can be queried through the very lightweight and fast DNS protocol to establish whether a mail server is listed or not.  Using them with SpamAssassin means you can give them a weighting instead of simply blocking them which is all you can usually do at the SMTP level, this way you're less likely to block legitimate mail that is leaving through a mail server that's having a bad day.

Whereas the Bayes functionality is a Bayesian estimation algorithm that is trained on your own mail data, if you have a junk mail folder then it can learn the difference between spam, and ham (or valid mail, according to SpamAssassin).

Amavis

amavisd-new is a fairly comprehensive mail filter that can provide all sorts of checking on email, spam and anti-virus etc.  It is largely a wrapper to other plugins and tools, for instance, for spam filtering, you need SpamAssassin installed as it calls that.

Frankly, I'm not entirely sure the purpose it serves, as all these things can be run directly by your mail server through "milters" (just as amavisd itself is called).  I use it, because it's a fairly common way of handling these functions and now that it's setup and working, I haven't had any need to question it.

It seems to serve as a wrapper to other utilities, and can push it's own config onto those utilities.  Whilst this might seem like a convenient way of managing many utilities I find it just obfuscates things.  Configurations that are setup for each utility might not be how it actually runs, but might be, it depends, on something.  And the config file itself is probably one of the better examples of Linux config mess - large unstructured text files which are a nightmare to parse, scattered with less than helpful comments.

Bayes filtering

And it is the messy config file I was staring at blankly when I was trying to figure out why my spam was no longer getting bayes hits in SpamAssassin.  Providing you have figured out how to get Amavis to log the rules that are hit, you should see something like this in your mail headers:

X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5.2
	tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no

This is working, the BAYES_ filters have run and deemed this example as ham (along with other rules).  But mine wasn't working.

It took much fiddling around to finally figure out that it was to do with the bayes database not being populated.  In order for it to work, you need to train it on your mailbox data, this is done with the command sa-learn (with either the --ham or --spam argument).  On my previous build this worked fine, but on the current build it wasn't updating the database.

I'm not sure what has changed, but it was now necessary for me to specify the location of the bayes database used by amavis, instead of just the default location (your home directory).  The fix is as simple as adding --dbpath /var/spool/amavis/.spamassassin to the sa-learn command.

It now means though that you'll probably have to run this via root to access that directory.

You can test whether amavis has the required 200 bits of spam logged in the bayes database by running; sa-learn --dump magic But, be warned, that has to run as the user that amavisd runs under, otherwise it'll use the wrong database.

Troubleshooting amavisd

To get to the above solution I shutdown amavisd-new and set $log_level = 5 in the /etc/amavisd.conf file, and then switched into a shell for amavisd (su vscan -s /bin/bash) and started it up manually with /usr/sbin/amavisd debug-sa

This spews masses of data on the startup of SpamAssassin in amavisd-new, and takes a long time to read through, but there will be a section on the bayes plugin loading and hopefully some hints there.  I spotted in mine that it didn't have enough spam profiled, and the directory of the database, which lead me to the above solution.

Breaking from the voyeuristic norms of the Internet, any comments can be made in private by contacting me.