stochastic spam detection
Attached is a ~/.procmailrc fragment that evaluates e-mail header construction of e-mail messages to determine the probability that an e-mail message is spam. The numbers were derived by analyzing two large e-mail repositories; one containing only spam, the other containing only non-spam. Let A be the number of messages in the spam archive that have a certain header characteristic divided by the total number of messages in the spam archive, and B be the number of messages in the non-spam archive that have the same characteristic divided by the total number of messages in the non-spam archive. (A is the likelihood of a spam message containing the characteristic, and B is the likelihood of a false positive.) Do this for many characteristics, dividing A by B, and picking the half dozen, or so, largest values to determine which characteristics will be used. Use the negative of the natural logarithm of B in the attached recipe. Example: For a spam archive, of 1000 messages, where 600 have a common characteristic; A = 600 / 1000 = 0.6. For a non-spam archive, of 2000 messages, where 2 have the same common characteristic; B = 2 / 2000 = 0.001. A / B = 0.6 / 0.001 = 600. If 600 is larger than any other A / B values for the half dozen other characteristics, then one of the other characteristics should be replaced with this one. The value used in the recipe would be ln (B) = ln (0.001) = -6.90775527898: * 6.90775527898^0 test_for_some_characteristic which should be included in the last recipe in the attached. The way it works is that that values, (6.90775527898 in the example,) for each characteristic that is true for a specific message are added together. Since the values are logarithms, this sum is a product. And, since the values represent probabilities of false positives, the chances of a total false positive are reduced with each true characteristic, significantly, (assuming the characteristics are statistically independent.) Note that no single characteristic (orbs, etc.,) is capable of generating a false positive, resulting in an inappropriately trashed message. It turns out to be about 70-85 percent effective, (but one can tinker with the sigma limits, and probably tweek a little more out of it.) The characteristics used in the attached recipe require: 1) A sendmail.cf, (or equivalent,) configuration that puts "... (unknown ... in a "Received: " record if RDNS fails for the HELO. 2) Some method of verifying whether the IP address contained in a "Received: " record is a black listed host. (I used, http://www.johncon.com/john/receivedIP/index.html, but rblcheck(1), etc., can be used just as well.) 3) The name of your smtp host, somesmtpserverdomain.com. 4) Your e-mail address, someone@somedomain.com. but can be modified to suit. (It was originally written to cut down spam in a mailing list running smartlist.) John -- John Conover conover@email.rahul.net http://www.johncon.com/ ###################################################################### # # Save the trusted return address. # :0 whc SENDER=| formail -rtzx To: # # Save the machine generated return address. # :0 whc FROM=| formail -rzx To: # # If the IP address in any "Received: " record is black listed, set # ORBS to 1, else 0. # # (Note: this uses the off line black list from: # http://www.johncon.com/john/receivedIP/index.html. Use any convenient # method to set ORBS 1 if black listed, 0 if not, like rblcheck(1), # etc.) # ORBS="0" # :0 * ? test -f "${HOME}/.procmail.reject" * ? /usr/local/bin/receivedIPdb "${HOME}/.procmail.reject" { ORBS="1" } # # If the smtp server had to generate the "Message-Id: " record, and the # message is not from someone in the smtp server's domain, set # MSGIDGENERATED to 1, else 0. # MSGIDGENERATED="0" # :0 * -3^0 * 2^0 ^message-id:.*somesmtpserverdomain\.com * 2^0 !^(from|reply-to):.*somesmtpserverdomain\.com { MSGIDGENERATED="1" } # # Evaluate header construction. # :0 * 6.35899699817^0 !^to: * 2.31011481332^0 !^(to|cc):.*someone@somedomain\.com * 5.05971401614^0 ^received:.*\(unknown +.*by.*somesmtpserverdomain\.com * 6.07131492861^0 ? test ${ORBS} != "0" * 7.45760928973^0 ? test ${SENDER} != ${FROM} * 3.87409035224^0 ? test ${MSGIDGENERATED} != "0" * 16^0 ^x-advertisement: * 16^0 ^subject:.*adv(ertise(ment)?.*)?(:|$) { HEADERSCORE = $= # # Sigma values: # # 1.84102164502 = ln (1 sigma false positive), 1 in 6.30297437513 # 3.78318433404 = ln (2 sigma false positive), 1 in 43.9557890318 # 6.60772622151 = ln (3 sigma false positive), 1 in 740.796695584 # 10.3601014878 = ln (4 sigma false positive), 1 in 31,574.3873622 # 15.0649983951 = ln (5 sigma false positive), 1 in 3,488,555.79111 # # Greater than 5 sigma chance of false positive can be safely # trashed. Raising the value from 5 sigma to 4 sigma increases # spam rejection, at the risk of false positives. # :0 * -15.0649983951^0 * $$HEADERSCORE^0 /dev/null # # Greater than 1 sigma chance of false positive, but less than 5 # sigma, is filed in the junk folder for evaluation; less than 1 # sigma has a significant probability of being non-spam. Raising # the value from 1 sigma to 0 increases spam rejection, at the # risk of false positives, which will be filed in the junk folder. # Decreasing the value to 2 sigma increases the risk of false # negatives, which will be passed. # :0: * -1.84102164502^0 * $$HEADERSCORE^0 junk } # ######################################################################
participants (1)
-
John Conover