Filtering spam with procmail

This is a collection of procmail recipes which I use to pre-filter the incoming mail before letting SpamBayes take a crack at it. This combination seems to provide a fairly decent level of protection. Naturally, your mileage may vary.

procmail is a mail processor installed on most Linux systems and used to by the mail server to deliver mail to your mailbox. If your mail server is hosted on linux you may use procmail to remove spam and sort messages before they are placed into your mailbox. If, like me, you prefer to use pine in a shell window to check your e-mail, this type of filtering may be your only defense against spam.

In order to understand these recipes, at least some knowledge of procmail and regular expressions syntax is required. The basics can be learned from procmail man pages and from links on procmail home page. Here is a great procmail documentation project and a library of recipes

Contents

Setting Up

procmail usually looks for its configuration in file ~/.procmailrc. I keep all additional recipe resources in directory ~/.procmail/ and a log in ~/logs/pmlog.

I do not delete most of the messages because of potential false positives and instead send them to different mailboxes. These mailboxes are in ~/mail/ and ~/mail/spam/

So, some initial code: set up log file, write a blank line to the log to separate from the previous message, and set up default junk mailbox. LOGABSTRACT set to "all" means for every message, procmail will log the subject, sender, time and the target mailbox.


LOGFILE=$HOME/logs/pmlog
LOG="
"

JUNK      = $HOME/mail/spam/bulk
SBSPAM    = $HOME/mail/spam/sb_spam
SBSUSPECT = $HOME/mail/sb_suspect

LOGABSTRACT=all

Whitelist and Blacklist

Before proceeding with any analysis of the message, it may be a good idea to see if the message is coming from one of the trusted people or domains and send it straight to the mailbox. Or, if it is coming from one of the known spam addresses, send it directly to junk folder. So here we have the white list and the black list

The following recipe runs formail to extract From, Sender, Reply-To, Return-Path and Received headers and pipe them to egrep which compares them to the contents of ~/.procmail/white.lst If the value returned by egrep indicates that it found matches, the recipe sends the message to default mailbox and exits.

When setting up white.lst, be sure it does not have any blank lines. If necessary, you can use regular expressions in the entries of white list.

The backslash at the end of the line of procmail code means that this line is continued on the next line. It is not necessary to limit the length of recipe lines but it makes for easier reading of long recipe conditions

:0
* ? formail -c -xFrom -xSender -xReply-To -xReturn-Path -xReceived |    \
    sed "s/[[:space:]]for .*$//g" | egrep -is -f ~/.procmail/white.lst
{
    LOG="WHITELISTED
"
    :0:
    $DEFAULT
}

The black list works much the same as the white list. The recipe is a bit more complicated because I like to know which part of the message matched the blacklist. So instead of testing the return value of egrep, I get the entire match in $BLACKLISTED variable, see if it is not empty and then write it to the log file. Backticks (`) execute the command between them and return output. Same as in bash. $JUNK, as set up earlier, is my bulk mailbox. The variables have a dollar sign in front of them when their value is accessed and do not have it when they are assigned.

BLACKLISTED = `formail -xFrom -xSender -xReply-To -xReturn-Path -xReceived |    \
	egrep -i -f ~/.procmail/black.lst`

:0
* ! BLACKLISTED ?? ^^^^
{
	LOG = "BLACKLISTED: "
	LOG = $BLACKLISTED
	LOG = "
"
	:0:
	$JUNK
}

My blacklist is here. Most of the entries in it are outdated but it costs very little in cpu time for egrep to run through the entire list so I let it grow.

Trace Headers

Trace headers are part of every e-mail message. They include Received: headers which show the path of the message from the sender to the recepient, and Return-Path: header which specifies the return address of the sender. Analysis of these headers can offer some very useful information about the origins of the message and perhaps its content.

DNSBL

One of best weapons against spam is a DNS Block List such as Spamcop or Spamhaus. These lists match ip address of the host relaying the message to your server against their database of known spam sources, open relays, hijacked machines and other known offenders. Querying against these blocklists should be done at mailserver level, but if your administrator does not want to set it up you can do it with procmail.

Here's the logic for such a recipe:

Here we go. This part extracts the ip of the sender. formail is used to extract the header. Grep extracts the line written by my mail server as it receives the message. Another call to grep excludes lines sent by my server - I am not intersted in checking my own server against dnsbl. And finally the result is piped to sed which exracts the ip address located between brackets.

After SENDERIP is assigned, check it to make sure it in fact contains an IP address. If not, unassign it. Note ^^ around the regular expression: it is a procmail extension to regex syntax which means the start and the end of the entire content. procmail regular expressios are multi-line - you can have $ or ^ in the middle of an expression.


SENDERIP = `formail -c -XReceived | grep "by benya.com" | grep -v "from benya.com" | \
     sed "s/^Received: from .*\[\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\].*by benya.com.*$/\1/"`

:0
* ! SENDERIP ?? ^^[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*^^
{
        SENDERIP =
}

Now, if sender ip is extracted, we'll check it against xbl-sbl.spamhaus.org. First step is to reverse the numbers in ip address using sed. Then run host to query the dns and extract the ip address from the result. If the resulting up is in desired range, write an entry into the log and trash the message


:0
* ! SENDERIP ?? ^^^^
{
        SENDER_REVERSED = `expr "$SENDERIP" | \
	sed "s/\([0-9]*\)\.\([0-9]*\)\.\([0-9]*\)\.\([0-9]*\)/\4.\3.\2.\1/"`

        KNOWNOFFENDER = `host "$SENDER_REVERSED".sbl-xbl.spamhaus.org | \
                sed "s/^.*\(127\.0\.0\.[0-9]*\)$/\1/"`

        :0
        * KNOWNOFFENDER ?? ^^127.0.0.[0-9]*^^
        {
                LOG = "sender host "
                LOG = $SENDERIP
                LOG = " is a known source of spam
"
        
                :0:
                $JUNK
        }
}

Messages Received From Dynamic IPs

A large percentage of spam comes from dynamic ips of dial-up, cable or dsl users who were unlucky enough to have had their machines hijacked. Dial-Up Host Lists, accessed similar to DNSBL, make it possible to find out if the ip is dynamic. Not all dynamic ips are reported by these lists, but enough to warrant querying them. One such list is dul.dnsbl.sorbs.net.

The recipe is very similar to the one used to query spamhaus. Since we already know SENDERIP and SENDER_REVERSED, we can skip some of it.


:0
* ! SENDERIP ?? ^^^^
{
	SENDER_DYNAMIC = `host "$SENDER_REVERSED".dul.dnsbl.sorbs.net | \
	sed "s/^.*\(127\.0\.0\.[0-9]*\)$/\1/"`

	:0
	* ! SENDER_DYNAMIC ?? ^^127.0.0.[0-9]*^^
	{
		SENDER_DYNAMIC =
	}
}

Now SENDERIP contains the ip of the server relaying the message and SENDER_DYNAMIC is non-empty if SENDERIP is dynamic. What can we do with this information?

A compromised host can sometimes send legidimate mail, and dynamic hosts are not all hijacked. I am willing to block all mail coming from known sources of spam, but since I run my own server on a dynamic ip I don't want to block all messages coming from a similarly configured host.

The following recipe will match the ip address of a dynamic host with the return address of the sender. If the sender's address host resolves to the same ip I will not trash this message. Otherwise, too bad. The logic here is that if you're running a mailserver and not sending your mail through your ISP's static relay, the least you could do is sign your messages with your real address. My e-mails would pass this test.

First step is to extract the host name from the return address


:0
* ^Return-path.*@\/[-a-zA-Z0-9_.]*
{
        RETURN_PATH_HOST = $MATCH
}

Now, query the DNS for the ip of RETURN_PATH_HOST and see if it matches SENDERIP. If it does, RESOLVED will not be empty


:0
* ! SENDER_DYNAMIC ?? ^^^^
{
        RESOLVED = `host -a "$RETURN_PATH_HOST" | grep "$SENDERIP"`
        
        :0:
        * RESOLVED ?? ^^^^
        $JUNK
}

Fake trace headers

I've been getting some mail with Received headers of the form Received: from FAKED_PART ([ddd.ddd.ddd.ddd]) where FAKED_PART is either the ip address of my server or simply the word "benya" and ddd.ddd.ddd.ddd is the ip address of some other server. Obviously a fake trace header, so here I get rid of this message.

getip.sh is a little script that gets the ip address from whatever interface I pass in as a parameter. You may already have a getip command on your system. If not, here is what's in the script:

/sbin/ifconfig $1 | grep "inet addr" | sed "s/^.*addr:\([0-9]*[.][0-9]*[.][0-9]*[.][0-9]*\).*$/\\1/g"

eth0 in my case is the card connected to outside network. This recipe could probably be done without calling formail and egrep but I tested the concept in shell and didn't want to bother rewriting it

:0
* ? formail -xReceived | egrep -is \
    'from (benya[^.]|'`/home/benya/bin/getip.sh eth0`').*\[[0-9][0-9]*\.[0-9][0-9]*'
{
	LOG="Fake trace header
"
	:0:
	$JUNK
}

Hooking up SpamBayes

If the spam message got this far, it'll have to be filtered out based on the contents of the subject or message body. SpamBayes is perfect for this job. The following is a variation on the recipe suggested by the developers.


:0 fw:hamlock
| /usr/bin/sb_filter.py


:0
* ^X-SpamBayes-Classification: \/.*
{
        SBCLASS = $MATCH

        :0
        * SBCLASS ?? ^spam
        {
                LOG="Trashing : X-SpamBayes-Classification = $MATCH
"

               :0:
                $SBSPAM
        }

        :0
        * SBCLASS ?? ^unsure
        {
                LOG="Suspect : X-SpamBayes-Classification = $MATCH
"

                :0:
                $SBSUSPECT
        }
}

Now just make sure to sort out the contents of sb_suspect mailbox before training the filter.

Unused recipes

My original filtering stategy suffered from the NIH syndrome. Eventually I was able to overcome it and hooked up SpamBayes, but not before writing a number of recipes that attempted to detect spam by analyzing the subjects and content of the messages. It worked quite well, until the spamers learned not to include meaningful subjects and switched from various html-based techniques designed to full the filters to pasting random passages of text into the bodies of the messages.

The following recipes are no longer in use

Mismatched return address

Here's another test that can be performed on trace headers. If the return address of the sender is on one of the several major ISPs such as AOL or MSN, it may be safe to assume that the mail should go through the mail servers of those ISPs. I know, I know, you can send a message from somewhere else but have people reply to your @yahoo.com address. Well, for that there's Reply-To header. Still, it is probably a good idea to send these messages to a "suspicious" folder rather than trash them outright. I did not get a single false positive with this filter in over a year but I do not get much mail from these major ISPs so who knows...

The list of included web mail servers and isps could be expanded, but here I am only interested in those addresses that are most used as a fake return address on spam messages.

RETURN_PATH_HOST was assigned earlier in dynamic ip recipe. msn.com mail sometimes comes from hotmail servers, so in case it's msn I allow for either host to be matched. The recipe searches for a match of regular expression contained in RETURN_PATH_HOST in Received: trace header. I also make sure that it is not preceded by an equals sign to avoid tricks such as helo=yahoo.com.


:0
* ^Return-path.*([^.]yahoo\.|@aol\.co|compuserve|@mail\.com|lycos|excite\.com|@usa\.net|hotmail|msn\.com)

{
	:0
	* RETURN_PATH_HOST ?? ^^msn.com^^
	{
		RETURN_PATH_HOST="(msn.com|hotmail.com)"
	}


	:0
	* $ ! Received.*[^=][ ]*${RETURN_PATH_HOST}
	{
		LOG="Sender and trace host mismatch : "
		LOG=$RETURN_PATH_HOST
		LOG="

"
		:0:
		$HOME/mail/spam/suspicious
	}
}

Miscellaneous Header Filters

Here is a simple but effective filter that significantly can reduce the amount of junk in your inbox. Filter out any mail written in unwanted character sets, such as GB2312.


:0
* GB2312
{
	LOG="Found GB2312 in the header of the message. Trashing...
"
	:0:
	$JUNK
}

This type of filter can also be applied on the body of the message. The following recipe tries to find charset GB2312 in a multipart message.


:0 B
* ^content-type.*(^.*)?charset=.*GB2312
{
	LOG="Found GB2312 in the body of the message. Trashing...
"
	:0:
	$JUNK
}

Sometimes it may be helpful to filter out mail that is sent to too many people at once. Like chain letters. The following recipe counts the number of addressees @samplehost.com and trashes any message that is sent to 5 or more at once

The recipe uses procmail scoring mechanism. The match is achieved when the total is above 0. Therefore I set the initial score to -4 and add 1 for each match


:0
* -4 ^0
* ^to:\/.*
* 1 ^1 MATCH ?? @samplehost\.com
* ^cc:\/.*
* 1 ^1 MATCH ?? @samplehost\.com
{
	LOG="Message sent to 5 or more addresses @samplehost.com. Trashing.
"
	:0:
	$JUNK
}

Subject

Filtering messages based on the contents of the Subject header is a fairly straight-forward business. There are only a couple of complications. One is the use of quoted-printable or base64 encoding making the subject header virtually unreadable by procmail. This problem is fixed below. The other is the actual content: the list of prohibited words and combinations of words can be pretty long, and as it gets longer the probability of a false positive grows with it. Also, many spammers use variations of the words substituting various characters or numbers for letters. For example, \/|@gr@ or c1al1s. The maintenance of this list takes too much time, so I prefer to leave it up to SpamBayes.

Decode Subject

The format of header encoding is defined in rfc 1522. It looks like this: =?charset?format?encoded text?=, where format is either q for quoted-printable or b for base64. Any part of the header can be encoded, and there may be more than one encoded part. The following recipes set two variables, $SUBJECT for decoded subject and $STRIPPED_SUBJECT for decoded subject stripped of anything but alphanumerics and blanks.

I've devised 2 ways to decode a header. The first uses sed and shell command substitution and string expansion, the second uses perl.

The following is the shell way of decoding the subject. It cannot properly handle quotes or parens, to it gets rid of them. These characters are not very important for filtering spam. First, extract the header of the message as one line using -c switch with formail. Then, if the subject contains encoded parts, run a shell one-liner to decode them

SUBJECT = `formail -c -xSubject`

:0
* SUBJECT ?? =\?[^?]+\?[qb]\?[^?]+\?=
{
  SUBJECT = `eval expr \"$(expr "$SUBJECT" | \
    sed "s/=[?]\([^?]*\)[?]\([bq]\)[?]\([^?]*\)[?]=/\\\`echo \3 | \
    mimencode -u -\\\$(echo \2 | tr [:upper:] [:lower:])\\\`/Ig" | \
    sed "s/[\"'()]//g")\"`

}

Here's a brief approximation of how this script works. The expression between the outer \"$( and )\" executes first. It pipes the contents of $SUBJECT to sed, which finds every occurence of "=?CHARSET?ENCODING?TEXT?=" and replaces it with a command to decode it: `echo TEXT | mimencode -u -$(echo ENCODING | tr [:upper:] [:lower:])`. mimencode requires lower-case switch, so ENCODING is converted using tr. When the string is expanded, the generated commands get executed and their output inserted in their place in the original string. The the result is assigned back to $SUBJECT

Another way to do this is this. Here's a more or less general perl script which can decode mime headers:

#!/usr/bin/perl

while (<>) {
  chomp;
  s/^[ \t]//;
  foreach (split(/(=[?].*?[?][BQ][?].*?[?]=)/i)) {
    if (($charset, $encoding, $txt) = /=[?](.*?)[?]([bq])[?](.*?)[?]=/i) {
      $encoding =~ tr/[BQ]/[bq]/;
      open PIPE, "echo '$txt' | mimencode -u -$encoding |";
      $_ = <PIPE>;
      close PIPE;
      chomp;
      }
    print $_;
  }
}

I am sure that perl experts could make it shorter. I don't claim to be any good at it :). Save it as, for example, $HOME/bin/decode_header. And here is the recipe to use this script in our context. Note the absense of -c switch - the perl script takes care of joining the lines.

SUBJECT= `formail -xSubject: | $HOME/bin/decode_header`

Finally, the last part of the decoding process. This recipe applies to both approaches.

STRIPPED_SUBJECT = `expr "$SUBJECT" | sed "s/[^ 0-9a-zA-Z]//g"`

This recipe strips the subject of any characters other than spaces and alphanumerics to make it easier to filter out things such as "V.I.A.G.R.A".

Filter Subject

These filters are only examples. I have given up trying to use this as an effective way to block spam. But these recipes can also be used to move messages into different folders based on subject, and they illustrate some procmail techniques

Define some character classes. Useful for dealing with numerous ways to spell "a" or "i" in "viagra".

A="a@àáâãäåÀÁÂÃÄÅ"
I="il1ìíîïÌÍÎÏ\|"

Scan $SUBJECT for viagra, cialis and vicodin

:0
* $ SUBJECT ?? ()\/(([\\][/]|v)[$I][$A]gr[$A]|c[$I][$A]l[$I]s|v[$I]cod[$I]n)
{
        LOG="subject contains $MATCH - must be spam
"
        :0:
        $JUNK
}

Now a similar recipe for $STRIPPED_SUBJECT. Again, only a sample set of words.

:0
* $ STRIPPED_SUBJECT ?? ()\/(v[$I][$A]gr[$A]|c[$I][$A]l[$I]s|p0rn|g1rls|gangbang)
{
        LOG="subject contains $MATCH - must be spam
"
        :0:
        $JUNK
}

Blocking single words is easy, but it does not go very far. How about blocking "free meds" or "cheap prescriptions"? Here's a scoring recipe that filters out content based on the number of occurences of suspicious words. First it initializes the count to 100. Then, for any word in a list which allows no more than one of the listed words, it subtracts 61. And for any word of a list that allows no more than two of the listed words it subtracts 41. So you can have virgin oil but not virgin teen, penis but not penis enlargement, free delivery but not free prescription online and so on. This quickly gets very tricky and produces too many false positives.

:0
* -100 ^0
* 61 ^1 STRIPPED_SUBJECT ?? (penis|enlarge|enhance|virgin|teen)
* 41 ^1 STRIPPED_SUBJECT ?? (prescription|online|price|free|delivery|discount|drugs)
{
        LOG = "Subject scored $= on suspected words count. Trashing...
"
        :0:
        $JUNK
}

Filtering the contents of the message

The following set of recipes attempts to extract text/plain and text/html parts of the body, decode them and analyze their contents. Most of the filters are designed to recognize and block attempts at hiding spam, such as using html comments to break up suspicious words.

Extracting the text

Before loading parts of the body into variables, it is necessary to make sure procmail can handle big buffers buffers. Hopefully 64k will be enough. So:

LINEBUF=65535

rc.decode

procmail syntax does not allow procedures, but it does allow to include a recipe script any number of times, in effect making it a subroutine with global variables. I used this functionality here.

The following chunk of code takes input text in $DECODE_INPUT, encoding ("base64" or "quoted") in $DECODE_ENCODING and produces $DECODE_OUTPUT. It is stored in rc.decode and called as necessary. The E flag in the recipes makes them else clauses to the previous match. So this code reads like this: if base64, decode base64, else if quoted, decode quoted, else just copy input to output.

:0
* DECODE_ENCODING ?? base64
{
	DECODE_OUTPUT = `expr "$DECODE_INPUT" | mimencode -u`
}
:0 E
* DECODE_ENCODING ?? quoted
{
	DECODE_OUTPUT = `expr "$DECODE_INPUT" | mimencode -q -u`
}
:0 E
{
	DECODE_OUTPUT = $DECODE_INPUT
}

rc.extractpart

Of all the recipes involved in extracting message body, rc.decode was probably the most straight-forward. The whole thing should probably have been done in perl, but I don't know perl... So here we go. The next piece is the actual extraction procedure, stored in rc.extractpart Given a chunk of text containing multipart mail message in $EXTRACT_INPUT, the boundary marker of the desired part in $EXTRACT_BOUNDARY, and the content type of the desired part in $EXTRACT_CONTENTTYPE, it extracts the part from input, decodes it, if necessary, using rc.decode, and returns it in $EXTRACT_OUTPUT

First, prepare regular expressions for start and end markers

BOUND_START = `echo "$EXTRACT_BOUNDARY" | $HOME/bin/regexgen`
BOUND_END   = `echo "--$EXTRACT_BOUNDARY" | $HOME/bin/regexgen -n`

regexgen is a little utility which turns input string into procmail regular expression with escaped special characters. With switch -n it returns a negative expression, which, for a string like "abc", would be something like [^a]|a[^b]|ab[^c]. Since we need a multi-line match, regexgen will actually return ([^a]|$)|a([^b]|$)|ab([^c]|$). Regexgen is written in pascal. Here is the source and an i386 linux binary. The source can be compiled with FreePascal or Borland Delphi or Kylix.

Next, extract the part, starting from $BOUND_START, up to and including $BOUND_END. the parts of a multipart message start with the boundary marker alone on a line followed by Content-Type: header on the next line. So that's what we look for. The expression in $BOUND_END makes sure the match will not read past the next occurrence of the $EXTRACT_BOUNDARY. Strictly speaking, a zero-length lookahead such as perl (?!$BOUND_START) would've been more appropriate, but such syntax is not available to procmail. The match is saved in $EXTRACT_FULL. If there's no match, $EXTRACT_FULL is initialized to blank.

:0
* $ EXTRACT_INPUT ?? $BOUND_START^Content-Type: *$EXTRACT_CONTENTTYPE.*$\/($BOUND_END)*
{
	EXTRACT_FULL = $MATCH
}
:0 E
{
	EXTRACT_FULL
}

Next, if $EXTRACT_FULL contains some data, strip out the rest of the headers. The match starts on the first blank line and continues to the end. If there's no match, copy $EXTRACT_FULL into $EXTRACT_OUTPUT

:0
* ! EXTRACT_FULL ?? ^^^^
* EXTRACT_FULL ?? (.+$)*^$\/(.*$)*
{
	EXTRACT_OUTPUT = $MATCH
}
:0 E
{
	EXTRACT_OUTPUT = $EXTRACT_FULL
}

Finally, if $EXTRACT_OUTPUT contains text encoded with base64 or quoted-printable encoding, decode it. The encoding is specified in Content-Transfer-Encoding header. It was stripped out in the previous recipe, so we'll look for it in $EXTRACT_FULL. If the encoding is found, set up the variables and call rc.decode

:0
* ! EXTRACT_FULL ?? ^^^^
* EXTRACT_CONTENTTYPE ?? text/
* EXTRACT_FULL ?? Content-Transfer-Encoding: *\/.*(base64|quoted-printable)
{
	#call decode
	DECODE_INPUT = $EXTRACT_OUTPUT
	DECODE_ENCODING = $MATCH
	INCLUDERC="$HOME/.procmail/rc.decode"
	EXTRACT_OUTPUT = $DECODE_OUTPUT
}

rc.bodytext

Now, put it all together. The recipes in rc.bodytext are a series of if-then-else clauses covering some of the more common content types. The solution is far from perfect as some of the types are not recognized. The most common combinations are covered and that's good enough for these filters.

The extracted contents are decoded if necessary. The output is returned in $BODY_PLAIN and $BODY_HTML.

First, see if there is any content-type declaration in the header. If not, the entire content is assigned to BODY_PLAIN

:0
* ! ^Content-Type
{
	BODY_PLAIN = `formail -I ""`

}

Now, the same plain text but declared as such. Since it's declared, it can be encoded. Check and decode using rc.decode if necessary.

:0 E
* ^Content-Type: text/plain
{
	BODY_PLAIN = `formail -I ""`
	
	:0
	* ^Content-Transfer-Encoding: \/.*(base64|quoted-printable)
	{
		DECODE_ENCODING=$MATCH
		DECODE_INPUT=$BODY_PLAIN
		INCLUDERC="$HOME/.procmail/rc.decode"
		BODY_PLAIN=$DECODE_OUTPUT
	}
}

Same as above, only for text/html.

:0 E
* ^Content-Type: text/html 
{
	BODY_HTML = `formail -I ""`

	:0
	* ^Content-Transfer-Encoding: \/.*(base64|quoted-printable)
	{
		DECODE_ENCODING=$MATCH
		DECODE_INPUT=$BODY_HTML
		INCLUDERC="$HOME/.procmail/rc.decode"
		BODY_HTML=$DECODE_OUTPUT
	}
}

Now, the multipart types. The types of most interest are multipart/alternative, or multipart/related with subtype multipart/alternative. These provide both plain text and html, and the html part often contains useful clues. Lumped in the following clause are multipart/alternative, multipart/mixed and multipart/related which does not have a multupart/alternative subtype (marked by type="..." header). Mixed is here because I don't know how or don't care to handle it properly. This is also a catch-all clause for related with subtypes that I didn't want to bother with.

This recipe uses rc.extractpart to extract the desired content from the body of the message. Look for the comments in the code for explanations

:0 E
* ^Content-Type: multipart/(alternative|mixed|related)
* ! type="multipart/alternative"
{
	#get boundary

	#first try the boundary in quotes
	:0
	* boundary="\/[^"]+
	{
		EXTRACT_BOUNDARY=$MATCH
	}
	
	#if that failed, try without the quotes
	:0 E
	* boundary=\/.+
	{
		EXTRACT_BOUNDARY=$MATCH
	}

	#if boundary is found, try to extract parts

	:0 
	* ! EXTRACT_BOUNDARY ?? ^^^^
	{
		EXTRACT_INPUT = `formail -I ""`

		EXTRACT_CONTENTTYPE= "text/plain"
		INCLUDERC = "$HOME/.procmail/rc.extractpart"
		BODY_PLAIN = $EXTRACT_OUTPUT

		EXTRACT_CONTENTTYPE = "text/html"
		INCLUDERC = "$HOME/.procmail/rc.extractpart"
		BODY_HTML = $EXTRACT_OUTPUT


	}
}

The last recipe of the file, multipart/related with subtype multipart/alternative. This recipe differs from above in that the extractor has to be called twice - first to get the multipart/alternative piece, then to get text/plain and text/html out of it. The logic for getting the boundary for the second extraction is also a little bit different - instead of getting it from a header, the recipe gets it from the content.


:0 E
* ^Content-Type: multipart/related
* type="multipart/alternative"
{
	# get the boundary, extract multipart/alternative subpart, then process it
	:0 
	* boundary="\/[^"]+
	{
		EXTRACT_BOUNDARY = $MATCH
	}
	:0 E
	* boundary=\/.+
	{
		EXTRACT_BOUNDARY = $MATCH
	}

	:0
	* ! EXTRACT_BOUNDARY ?? ^^^^
	{
		EXTRACT_INPUT = `formail -I ""`
		EXTRACT_CONTENTTYPE="multipart/alternative"
		INCLUDERC="$HOME/.procmail/rc.extractpart"
		
		EXTRACT_INPUT = $EXTRACT_OUTPUT
	}

	# if multi/alternative was successfully extracted, process it
	:0 
	* ! EXTRACT_BOUNDARY ?? ^^^^
	* ! EXTRACT_INPUT ?? ^^^^
	{

		#getting boundary is different from above
		#extractor stripped the boundary="..." line, so use other logic
		#boundary will be on the first non-blank line following --

		#clear old boundary
		EXTRACT_BOUNDARY

		:0
		* EXTRACT_INPUT ?? ^^(^)*--\/.*
		{
			EXTRACT_BOUNDARY = $MATCH
		}

		:0
		* ! EXTRACT_BOUNDARY ?? ^^^^
		{
			EXTRACT_CONTENTTYPE = "text/plain"
			INCLUDERC="$HOME/.procmail/rc.extractpart"
			BODY_PLAIN = $EXTRACT_OUTPUT

			EXTRACT_CONTENTTYPE = "text/html"
			INCLUDERC="$HOME/.procmail/rc.extractpart"
			BODY_HTML = $EXTRACT_OUTPUT
		}
	}
}

Finally, back to the main .procmailrc - the call to rc.bodytext

INCLUDERC="$HOME/.procmail/rc.bodytext"

Filtering HTML

Now the payload. These filters are are designed to catch most obvious attempts to hide spam using HTML tags, not look for keywords or statistics. The following recipes execute only if there is something in $BODY_HTML. Here's the start of the long recipe encapusalting the filters

:0
* ! BODY_HTML ?? ^^^^ 
{

Too many comments

This recipe counts the number of html comments (text enclosed between <!-- and -->) and similar invalid markup starting with <!. This kind of markup is often used to break apart keywords that a spam filter could notice. The recipe will allow no more than 25 valid or 10 invalid comments. Logging can be added as necessary - I removed it from the code below to avoid repetition.

:0
*  -100 ^0
*  4 ^1 BODY_HTML ??  [<][!]--[^>]*--[>]
*  10 ^1 BODY_HTML ??  [<][!]([^\-]|-[^\-])
$JUNK

Invalid HTML tags

This recipe counts invalid tags which can also be used to break up words. I've set up a list of valid HTML tags in .procmail/valid-html-tags. The recipe extracts tags from $BODY_HTML, one per line, and compares them with the contents of the tags file. INVALID_TAGS receives the list of all tags that did not match. Then the recipe counts the invalid tags and junks messages containing 10 or more

INVALID_TAGS = `expr "$BODY_HTML" | grep "<" | sed "s/</\n</g" | \
	grep -i "<[/]\?[a-z0-9]" | \
	sed "s/<\/\?\([a-z0-9][^ >=/]*\)\([^a-z0-9].*\)\?$/\1/gi" | \
	grep -vix -f $HOME/.procmail/valid-html-tags`

:0
* -8 ^0
*  1 ^1 INVALID_TAGS ?? ^.+$
$JUNK

Numeric character references

Numeric character references in the form of &#dd for decimal or&#xHH for hexadecimal are another way for the spammer to hide from the spam filters. Not from this one! :). The limit is set at 40 references.

:0
* -40 ^0
* 1 ^1 BODY_HTML ?? &#x?[0-9A-F][0-9A-F]*;
$JUNK

Embedded images and hidden text

This filter tries to block a combination of embedded images (sent with the message in multipart/related and included in html via img src="cid:...") and invisible text (hidden from view by setting white font color) designed to fool spam filters. This recipe can produce plenty of false positives but at the time the risk seemed acceptable. Stop after 5 occurences of src="cid or color="#FFFFFF. This should also include color "#fff" and "white"...

:0
* -5 ^0
* BODY_HTML ?? color="#FFFFF
* BODY_HTML ?? src="cid:
* 1 ^1 BODY_HTML ?? color="#FFFFF
* 1 ^1 BODY_HTML ?? src="cid:
$JUNK

URL-encoded urls

This recipe filters out the messages with urls that are already url-encoded. There's no reason for any url in e-mail to have too many url-encoded characters in a row, other than to hide a blacklisted site. Blocking all addresses with 5 or more url-encoded characters in a row

:0
* BODY_HTML ?? ()\/<a[^>]*href *= *"http://[^"]*%[0-9a-f][0-9a-f]+%[0-9a-f][0-9a-f]+%[0-9a-f][0-9a-f]+%[0-9a-f][0-9a-f]+%[0-9a-f][0-9a-f]+[^>]*
{
	LOG="Suspicious way of specifying URL: "
	LOG=$MATCH
	LOG=" -- trashing...
"
	:0:
	$JUNK
}

Empty Tags

This recipe counts empty tags (such as <a href=...></a> or <font></font>) and blocks any messages with 6 or more of these. This recipe includes a perl one-liner. Practically any empty tag could be used to break up words - <b></b>, <em></em>. For reasons I no longer recall I limited this recipe to counting links and fonts, but it should really count many more.

EMPTY_TAGS = `expr "$BODY_HTML" | \
perl -e '$F = join ("", <>); $F =~ s/\n/ /g; while ($F =~ /(<([a-z]+) [^>]*><\/\2.*?>)/gi) {print $1 . "\n";}'`

:0
* -5 ^0
* 1 ^1 EMPTY_TAGS ?? ^<(A |FONT)
$JUNK

Finally, the end of the recipe stared here

}

There are many other things that html can be tested for. For example, to stop a sizeable chunk of phishing messages, a recipe would have to compare the text and the address of every link. If the text contains a url and the domains of the real address and the one specified in text do not match, the message is probably spam.

Filtering all text

This filter is designed to block messages based on certain keywords, same as subject filter. First, the contents of $BODY_TEXT and $BODY_HTML are lumped in together

BODY_ALL = `expr "$BODY_HTML""$BODY_PLAIN"`

Now, search $BODY_ALL for banned words or patterns and block messages that match:

:0
* BODY_ALL ?? ()\/(pillsbusiness|emailremovals|yesmail|email-publisher|result of your feedback form|WEEKLY STOCK (PICK|report))
{
	LOG="body matched "
	LOG=$MATCH
	LOG="... trashing
"
	:0:
	$JUNK
}

The list above is only a sample. It could be maintained externally, similar to blacklist recipe

Useful recipes

The following recipes have nothing to do with spam. They are simply useful things you can do with procmail

Send SMS

This recipe sends an sms message to user's phone/beeper whenever there's a new message from someone on the whilelist

First, set up rc.sendsms. To decode Subject: and From: headers, it uses decode_header perl script from decode subject recipe. SMS can only accept 160 characters, so the body of the message is cropped right after it is extracted from the message. The sms system does the rest of the cropping, so this is just a bandwidth saving measure. Once $SUBJECT, $FROM and $BOD are set up, the message is sent via sendmail to the e-mail corresponding to user's sms. On Sprint PCS, it is the phone number @messaging.sprintpcs.com. On verizon it is the number @vzwtext.com. I don't know about the other providers, you'd have to find out from them.

FROM = `formail -xFrom: | $HOME/bin/decode_header`
SUBJECT= `formail -xSubject: | $HOME/bin/decode_header`
BOD = `formail -I "" | head -c 160`
SENT = `echo mail from $FROM about $SUBJECT: $BOD | \
	/usr/lib/sendmail 0123456789@messaging.sprintpcs.com`

Save this as rc.sendsms. Now, add this to the white list spam recipe right before the line containing LOG="WHITELISTED:

 INCLUDERC=".procmail/rc.sendsms"

Top of the page