Email spam is universally loathed. It’s difficult to prevent entirely, not only because spammers have a wealth of techniques at their disposal, but because so many legitimate mailers are misconfigured or routinely behave like spammers. The best approaches to combating spam involve multiple techniques to combat various spam techniques. I’ll outline what works, and what doesn’t, and hopefully provide some insight into how spammers work, and some of the more sleazy techniques I’ve encountered.
There’s a spectrum of spam, from the terrifically illegitimate to the “legitimate,” where a semi-reputable company adds you to their mailing list because of something you ordered (perhaps you left a default box checked that said “I want to receive marketing material via email.”) On the illegitimate side are usually commercial operations dedicated to spam, often using zombie farms of compromised machines to send out vast volumes. They often use sophisticated techniques to avoid content filters (like sending vast amounts of legitimate-sounding gibberish.) Eliminating the maximum amount of spam requires a multiple layered approach. I’ve outlined this for mail administrators:
Layer 1: Blacklists and Server Verification
Spam blacklists are simply wonderful at eliminating most spam from bot-farms and sleazy operators. Blacklists are DNS lookups where you can verify an IP address is not on it before you accept email from that IP address. False positives are nearly zero for the good lists, though every now and then somebody like AOL makes it onto a blacklist, but at the best blacklists, this doesn’t happen. A review of our mail servers’ statistics show that sbl-xbl.spamhaus.org is solely responsible for rejecting over 95% of attempts to spam our server. This capability is provided by milter-dnsbl.
Server verification covers the other 5%. In a nutshell, this verifies that the IP address that a system provides during the MTA phase of negotiations is legitimate. Over time, we’ve encountered a few mailers that, for whatever reason, have run afoul of this filter, either from misconfiguration, or from perversely sending email from an unresolvable address. I can’t think of a legitimate reason why anybody would feel the need to use unresolvable addresses to send mail; in cases where I’ve pursued this, it’s generally been the fault of a bumbling administrator or IT department. Every time I’m tempted to relax this requirement, I look at the volumes of spam eliminated and think, hey, if you can’t configure your own mailer properly, maybe nobody should accept mail from you. This capability is provided by spamilter.
Layer 2: Greylists
After making it past the blacklist, the next thing encountered by a would-be mailer is the greylist. To put it succinctly, a greylist is a way of telling certain mailers, “try again later.” Legitimate mailers will do exactly that, while a lot of spam farms give up confusedly. For others, it gives them enough time to be placed on a blacklist next time they make the attempt. A greylist works by tracking the IP (and often, origin email) of the mailer that is contacting you. Next time that same mailer contacts you, if enough time has expired, it’s allowed through.
The tricky part about greylists is coping with the behavior of some mailers, particularly big ones. Those that adhere to SPF are easy, most greylists will happily let SPF-compliant mailers right through. For the rest, most greylist implementations have a “whitelist” of mailers that respond poorly to the technique, either by sending from a different IP address every time (and therefore never satisfying the waiting period) or known issues where mailers may get confused or not retry for a very long time.
Another side effect is that legitimate mail can, and will, be delayed. A particularly effective technique is to greylist all email from origins not within your country — in my case, skipping the greylist for US-origin addresses interferes with as little mail as possible — and most of the spam comes from non-US computers. This capability is provided by milter-greylist.
Layer 3: Content Filtering
Hopefully, most spam is eliminated before we get this far, because no matter how sophisticated content filtering gets, it can be problematic to consistently separate spam from (for example) messages from a family member who spells poorly and has questions for you about Viagara.
So the first thing to go is make another run through the blacklists. While this may seem redundant, the reason for this is that it will pick up blacklisted IP addresses that are relaying through somewhere else. A common spam technique is to create an email forwarding address for you on a service like bigfoot (I see a lot of these) and then spam that address, which merrily forwards all the spam to you, thus effectively skipping the blacklist — unless you scan through all the headers, too. This capability is provided by spamilter.
The next thing to do is eliminate the obvious — mangled email. While spammers make an effort to make their mail look legitimate, invalid or multiple headers can result from spam being relayed through security holes in web sites. Spammers generally can’t see the results, nor do they care. In a related way, it’s a common technique for spammers to add multiple headers of the same type, violating most specifications but often bypassing content filters that expect mail to be in mail format, or by pumping through headers designed to exploit loopholes in clients or to overload mail servers. Since people using legitimate mail clients aren’t capable of producing broken mail, getting rid of broken mail causes no harm. This capability provided by mimedefang.
The next capability is filtering the content itself using a number of heuristic techniques that have been tuned over time, using capabilities provided by Spamassassin. Spamassassin does quite a good job, although sophisticated spammers will regularly test their spam content against its rules. Therefore, a good practice is to update its rules regularly using sa-update.
It’s also worth eliminating virus spam at this point. clamav provides this capability handily. As with spamassassin, it’s most effective when updated regularly.
Level 4: Sieve Rules
At this point, there is still a potential for false positives, and some things are going to slip through. Therefore, content filters normally just flag email. Sieve rules are a hierarchy of rules that determine how to treat email. So legitimate mail can be saved from the junk filter, and persistent spammers can be shuttled over to the Junk folder. These are normally in the hands of end users, but general rules can be effective site-wide.
Level 5: Don’t REPLY
This is true on a number of levels, the first being that a mailer should summarily reject all mail that’s not to legitimate users, rather than accept it, and attempt to bounce it back. There’s a whole class of spam known as “bounce spam,” where the “reply to” address is the actual victim, and the spammer sends email to a legitimate mailer and an invalid address. The mailer happily forwards it “back,” which actually sends the spam to the victim. There’s no benefit to ever automatically emailing the reply to address from the mailer level, either to inform an end user that they’ve typoed an email address (rejection serves that purpose adequately) or to inform an end user that they’ve sent a virus — the reply address is almost never the originator.
This also extends to the end user. For legitimate businesses, replying is usually an effective way to be removed from their mailing list — if you recognize the domain and have done business with them, there’s little risk. More sleazy operators, however, take the opportunity to add your legitimate email address to hundreds of other lists, even while nominally removing you from the list you’re presumably unsubscribing from. Your legitimate email address can now be sold to other spammers.
In a similar vein, it’s often a bad idea to click where it says “click here to be removed” for the same reasons. A particularly sleazy form of this actually takes you to a page covered with ads, and the unsubscribe box (filled in) in the middle. The spammer has now made money, because you’re a unique visitor to whom those ads have been displayed — even more if one catches your eye and you click on it.
Level 6: Report Spam
Reporting spam has a number of benefits, the biggest one is the overall reduction in spam. Spamcop is probably the best way to report spam — it sends email directly to the administrators of the systems, which are either misconfigured (open proxies or relays) or a customer of theirs is the spammer. Spamcop does an excellent job of analyzing email headers and finding out who’s really responsible. Note that spammers will often include legitimate URL’s in their spam, so it’s best to pay close attention to who the reports are being sent to any why.
Here are some additional thoughts:
Greylisting is still very good. But is a dying technology because bots are starting to retry a second time! However, some argue that the work around is to ignore retries until X number of seconds/minutes go by since the original attempt.
Reporting to SpamCop is great… and SpamCop is one of the two best DNSBLs (alongside SpamHaus’s Zen). But some spam “flies under the radar” of spamcop and Zen. Much of this is the same spam you alluded to earlier when talking about “receive offers from others” checkboxes… THAT stuff is starting to get out of control… particuarly when the data is sold to spammers… or when the original web site was run by an eggregious spammer. (some spammers set up legit looking sites just for the purpose of growing lists.. a single form post and your inbox is screwed for years to come!)
“sbl-xbl.spamhaus.org” is now a subset of the superior “zen.spamhaus.org” but just make sure that either lists are NOT used against all the IPs in the header. They should *only* be run against the sender’s IP.
Overall… great blog post!