| Saturday, November 15, 2003 |
|---|
Spam Fighting
Update 2004-04-25: Enterprise Spam FilteringI've started to filter spam about half a year ago. Since then I've been tuning my setup. I think I've reached the point where I can say that Spam isn't a problem for me anymore. Here's how my setup looks like:
SpamAssassin version 2.60 invoked via this simple procmail rule:
The next rule will put all emails tagged by SpamAssassin into spam folder.:0fw | spamassassin
On my Linux box all I had to do is put above line in my $HOME/.procmailrc. Depending on your system you may need to invoke procmail using $HOME/.forward.:0: * ^X-Spam-Flag: YES $HOME/Mail/spam
I'm using pretty much default SpamAssassin configuration. The only two changes I did are described at the end of this post. The default configuration means among other things that the required_hits is set to 5. That means that all emails that have more than five SpamAssassin hits are considered spam. Typically, very innocent messages are large negative numbers.
DNS based IP address spam list sbl.spamhaus.org. I had some bad experience with some other blocklists that seem to shoot first and ask second. Can you believe: there are such blocklists that allow anyone to blocklist anyone else in the world without any check whatsoever. Of course I had problems with false positives. Innocent people were getting nasty bounces accusing them of spamming. Spamhaus is manually maintained and I haven't had any problems with it so far. Following lines in /etc/sendmail.cf did the trick:
R$* $: $&{client_addr}
Thanks Carsten for setting it up!
R$-.$-.$-.$- $: <?> $(ednsbl $4.$3.$2.$1.sbl.spamhaus.org. $: OK $)
R>OK $: OKSOFAR
R>$+<TMP> $: TMPOK
R>$+ $#error $@ 5.7.1 $: "550 Your mailserver spammed me, see http://www.abuse.net/sbl.phtml?IP="$&{client_addr}
Now we can enjoy looking at lines like this in /var/log/mail:
Nov 15 11:39:45 p15135922 sendmail[21367]: ruleset=check_relay, arg1=oon10.onetime-offers.net, arg2=127.0.0.2, relay=oo n10.onetime-offers.net [206.162.135.59] (may be forged), reject=550 5.7.1 Your mailserver spammed me, see http://www.ab use.net/sbl.phtml?IP=206.162.135.59
Bayesian filtering (see Paul Graham's Plan for Spam for intro on Bayesian filtering). I'm using the filter built into SpamAssassin. A nice thing about it is autolearning. Messages above certain treshold will be automatically fed into the Bayesian learning engine. I'm using the default setup that came with SpamAssassin 2.6. That is, I didn't need to do anything. It has in /usr/share/spamassassin/10_misc.cf
This means that messages with more than 12 SpamAssassin hits will automatically be learned by the Bayesian engine as spam, whereas messages with less than 0.1 hits will be learned as nonspam (also called ham in spam fighting lingo).bayes_auto_learn_threshold_nonspam 0.1 bayes_auto_learn_threshold_spam 12.0
The best thing about Bayesian filtering is that really bad spams are actually helping you fight spam! Every spam with score above 12 will teach my Bayesian dog new tricks which can then be used also to detect spams that would otherwise not exceed the treshold.
This still leaves all the spams with score between 5 and 12 out of the Bayesian engine as well as all hams with score between 0.1 and 5. Currently I am feeding these manually into the Bayesian engine:
After enough spams have accummulated in the spam mailbox I run this script.
OK, so all emails that SpamAssassin thinks they are spams end up in $HOME/Mail/spam. What about false positives? How can I be sure that an important email didn't get misclassified as spam? Well I check the spam folder once a week. For most messages it is enough to quickly glance at the Subject line and sender. Very few need to actually be opened to be sure they are spams. But it doesn't make any sense, does it? If I have to check all the emails manually anyway what's the use of automatic spam filter? Well, the spam filter still has a huge benefit that it prevents distraction. Reading and answering email is a very different activity from filtering spam and "You have new mail" is finally a happy message again.
Still, as I get hundreds of spam messages every week it is really boring looking at all of these emails. Worse yet, after inspecting subject lines of first 50 or so messages I get inpatient and I speed up. That leaves an uneasy feeling that some legit emails might have slipped.
That's why the following final piece of my setup is absolutelly critical:
-
sorting messages in the spam folder by SpamAssassin score. That way I can look at those messages with lower score first. Those are much more likely to be legit mail falsely classified as spam. I achieved this by adding the following two lines into my $HOME/.spamassassin/user_prefs:
This adds the score assigned by SpamAssassin at the beginning of subject of every message that gets written to $HOME/Mail/spam. This way my spam folder looks like this when sorted by Subject in mutt:rewrite_subject 1 subject_tag _HITS_ |
So, I start looking at messages slowly and carefully, but by the time I reach those with score over 15 or 20 I know I can safely speed up. It is really impossible for a legit email to get a SpamAssassin report like this:1 + Nov 15 Herma Amstutz ( 110) 07.71 | Some facts you may find use 2 F Nov 14 To bdolicki@bra ( 72) 10.57 | 3 + Nov 13 Kenny Villanuev ( 205) 12.66 | Delete your Internet tracks 4 + Nov 15 Admin ( 68) 12.66 | Re: did you do it yet? 5 + Nov 14 Front office ( 70) 13.40 | bishop 6 + Nov 14 AmandaBear43@ne ( 108) 13.57 | Hiya, My name is Jennifer... 7 + Jun 11 Buford Riley ( 105) 14.19 | Add length to your Member 8 + May 05 Postmaster ( 71) 14.19 | Re: login info 9 + Nov 15 Michael Callao ( 76) 14.46 | ..,HGH Seal, Weight Loss, Fi 10 + Oct 06 Jaime Larkin ( 106) 14.69 | Hey My girl Bought me the pa 11 + Nov 14 Bradly Rhoades ( 240) 14.83 | Next day shipping on your me 12 + Nov 15 Chester Trevino ( 80) 14.94 | Did you ever know? 13 + Nov 14 Matthew Brewer ( 80) 14.94 | I have the cure. 14 + Nov 14 Willie Horn ( 242) 15.25 | Why waist your time at docto 15 + Nov 14 Antoine Dubois ( 130) 15.54 | Is V i a g r a Right For Y 16 + Nov 14 Carey Rose ( 86) 16.15 | You will add inches with thi 17 + Nov 13 Cliff C. Lay ( 114) 16.80 | Hi! 18 + Nov 14 Kelsey G. Camer ( 86) 17.10 | Hey I bought my Man the Patc 19 + Nov 14 Crystal Helms ( 92) 17.37 | Hi 20 + Nov 13 Dillon Osborn ( 91) 17.74 | The Penis Enlargement Patch
1.0 FROM_ENDS_IN_NUMS From: ends in numbers 4.1 SUBJ_VIAGRA Subject includes "viagra" 4.2 DATE_SPAMWARE_Y2K Date header uses unusual Y2K formatting 0.3 ORDER_NOW BODY: Encourages you to waste no time in ordering 4.1 VIAGRA BODY: Plugs Viagra 0.1 HTML_60_70 BODY: Message is 60% to 70% HTML 0.1 HTML_FONTCOLOR_BLUE BODY: HTML font color is blue 0.1 HTML_MESSAGE BODY: HTML included in message 0.3 HTML_FONT_BIG BODY: HTML has a big font 5.4 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000] 0.3 MIME_HTML_ONLY BODY: Message only has text/html MIME parts 0.1 HTML_FONTCOLOR_RED BODY: HTML font color is red 0.6 MIME_HTML_NO_CHARSET RAW: Message text in HTML without charset 3.8 USERPASS URI: URL contains username and (optional) password 1.9 DATE_IN_FUTURE_03_06 Date: is 3 to 6 hours after Received: date 1.1 RCVD_IN_SORBS_HTTP RBL: SORBS: sender is open HTTP proxy server [24.128.81.88 listed in dnsbl.sorbs.net] 1.2 RCVD_IN_SORBS_MISC RBL: SORBS: sender is open proxy server [24.128.81.88 listed in dnsbl.sorbs.net] 0.1 RCVD_IN_SORBS RBL: SORBS: sender is listed in SORBS [24.128.81.88 listed in dnsbl.sorbs.net] 0.1 RCVD_IN_NJABL RBL: Received via a relay in dnsbl.njabl.org [24.128.81.88 listed in dnsbl.njabl.org] 1.2 RCVD_IN_NJABL_SPAM RBL: NJABL: sender is confirmed spam source [24.128.81.88 listed in dnsbl.njabl.org] 0.7 RCVD_IN_DSBL RBL: Received via a relay in list.dsbl.org [<http://dsbl.org/listing?ip=24.128.81.88>] 1.5 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see <http://www.spamcop.net/bl.shtml?24.128.81.88>] 2.6 RCVD_IN_DYNABLOCK RBL: Sent directly from dynamic IP address [Dynamic/Residential IP range listed by] [easynet.nl DynaBlock - <http://dynablock.easynet.nl/errors.html>] 1.6 MISSING_MIMEOLE Message has X-MSMail-Priority, but no X-MimeOLE 1.0 FORGED_OUTLOOK_HTML Outlook can't send HTML message only 1.2 HTML_MIME_NO_HTML_TAG HTML-only message, but there is no HTML tag 1.0 FORGED_OUTLOOK_TAGS Outlook can't send HTML in this format 0.1 CLICK_BELOW Asks you to click below 1.1 MIME_HTML_ONLY_MULTI Multipart message only has text/html MIME parts 2.6 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook 4.2 OBFUSCATING_COMMENT HTML comments which obfuscate textThis sorting by SpamAssassin score led me to even consider lowering the treshold from 5 to 4 or even 3. Spam filtering is not a binary choice anymore. I've noticed, for example, that legit emails with score between 4 and 5 are all messages from some lame mailing lists or (theoretically solicited) bulk email. In fact, I've noticed a strong correlation between uselesness of a message and its SpamAssassin score :-). I'm wondering whether I should instead of spam filtering simply sort all my mail by SpamAssassin score and look at those with lowest score first! After all, when was it that you have received a useful email that had only text/html MIME parts! :-)
-
|
Posted by Branimir Dolicki at 12:02 |
|
# - G - Add comment |
You may request notification for Branimir's Blog.