stochastic spam detection

5 Jun 2001

      Attached is a ~/.procmailrc fragment that evaluates e-mail header
construction of e-mail messages to determine the probability that an
e-mail message is spam.

The numbers were derived by analyzing two large e-mail repositories;
one containing only spam, the other containing only non-spam. Let A be
the number of messages in the spam archive that have a certain header
characteristic divided by the total number of messages in the spam
archive, and B be the number of messages in the non-spam archive that
have the same characteristic divided by the total number of messages
in the non-spam archive. (A is the likelihood of a spam message
containing the characteristic, and B is the likelihood of a false
positive.)

Do this for many characteristics, dividing A by B, and picking the
half dozen, or so, largest values to determine which characteristics
will be used.

Use the negative of the natural logarithm of B in the attached recipe.

Example:

    For a spam archive, of 1000 messages, where 600 have a common
    characteristic; A = 600 / 1000 = 0.6.

    For a non-spam archive, of 2000 messages, where 2 have the same
    common characteristic; B = 2 / 2000 = 0.001.

    A / B = 0.6 / 0.001 = 600. If 600 is larger than any other A / B
    values for the half dozen other characteristics, then one of the
    other characteristics should be replaced with this one.

    The value used in the recipe would be ln (B) = ln (0.001) =
    -6.90775527898:

        * 6.90775527898^0 test_for_some_characteristic

    which should be included in the last recipe in the attached.

The way it works is that that values, (6.90775527898 in the example,)
for each characteristic that is true for a specific message are added
together. Since the values are logarithms, this sum is a product. And,
since the values represent probabilities of false positives, the
chances of a total false positive are reduced with each true
characteristic, significantly, (assuming the characteristics are
statistically independent.)

Note that no single characteristic (orbs, etc.,) is capable of
generating a false positive, resulting in an inappropriately trashed
message.

It turns out to be about 70-85 percent effective, (but one can tinker
with the sigma limits, and probably tweek a little more out of it.)

The characteristics used in the attached recipe require:

    1) A sendmail.cf, (or equivalent,) configuration that puts
       "... (unknown ...  in a "Received: " record if RDNS fails
       for the HELO.

    2) Some method of verifying whether the IP address contained in a
       "Received: " record is a black listed host. (I used,
       http://www.johncon.com/john/receivedIP/index.html, but
       rblcheck(1), etc., can be used just as well.)

    3) The name of your smtp host, somesmtpserverdomain.com.

    4) Your e-mail address, someone@somedomain.com.

but can be modified to suit. (It was originally written to cut down
spam in a mailing list running smartlist.)

        John
--

John Conover   conover@email.rahul.net   http://www.johncon.com/

######################################################################
#
# Save the trusted return address.
#
:0 whc
SENDER=| formail -rtzx To:
#
# Save the machine generated return address.
#
:0 whc
FROM=| formail -rzx To:
#
# If the IP address in any "Received: " record is black listed, set
# ORBS to 1, else 0.
#
# (Note: this uses the off line black list from:
# http://www.johncon.com/john/receivedIP/index.html. Use any convenient
# method to set ORBS 1 if black listed, 0 if not, like rblcheck(1),
# etc.)
#
ORBS="0"
#
:0
* ? test -f "${HOME}/.procmail.reject"
* ? /usr/local/bin/receivedIPdb "${HOME}/.procmail.reject"
{
    ORBS="1"
}
#
# If the smtp server had to generate the "Message-Id: " record, and the
# message is not from someone in the smtp server's domain, set
# MSGIDGENERATED to 1, else 0.
#
MSGIDGENERATED="0"
#
:0
* -3^0
* 2^0 ^message-id:.*somesmtpserverdomain\.com
* 2^0 !^(from|reply-to):.*somesmtpserverdomain\.com
{
    MSGIDGENERATED="1"
}
#
# Evaluate header construction.
#
:0
* 6.35899699817^0 !^to:
* 2.31011481332^0 !^(to|cc):.*someone@somedomain\.com
* 5.05971401614^0 ^received:.*\(unknown +.*by.*somesmtpserverdomain\.com
* 6.07131492861^0 ? test ${ORBS} != "0"
* 7.45760928973^0 ? test ${SENDER} != ${FROM}
* 3.87409035224^0 ? test ${MSGIDGENERATED} != "0"
* 16^0 ^x-advertisement:
* 16^0 ^subject:.*adv(ertise(ment)?.*)?(:|$)
{
    HEADERSCORE = $=
    #
    # Sigma values:
    #
    # 1.84102164502 = ln (1 sigma false positive), 1 in 6.30297437513
    # 3.78318433404 = ln (2 sigma false positive), 1 in 43.9557890318
    # 6.60772622151 = ln (3 sigma false positive), 1 in 740.796695584
    # 10.3601014878 = ln (4 sigma false positive), 1 in 31,574.3873622
    # 15.0649983951 = ln (5 sigma false positive), 1 in 3,488,555.79111
    #
    # Greater than 5 sigma chance of false positive can be safely
    # trashed. Raising the value from 5 sigma to 4 sigma increases
    # spam rejection, at the risk of false positives.
    #
    :0
    * -15.0649983951^0
    * $$HEADERSCORE^0
    /dev/null
    #
    # Greater than 1 sigma chance of false positive, but less than 5
    # sigma, is filed in the junk folder for evaluation; less than 1
    # sigma has a significant probability of being non-spam. Raising
    # the value from 1 sigma to 0 increases spam rejection, at the
    # risk of false positives, which will be filed in the junk folder.
    # Decreasing the value to 2 sigma increases the risk of false
    # negatives, which will be passed.
    #
    :0:
    * -1.84102164502^0
    * $$HEADERSCORE^0
    junk
}
#
######################################################################

John Conover

tags

participants (1)