OliverScott
Verified User
- Joined
- May 4, 2007
- Messages
- 57
As a matter of interest I have tried to analyse my SpamAssassin scores taken from my MAILLOG file for the last few days.
I'm sure this could have been done directly on the server - will look into this at some point.
1 - Remove everything but the spamassassin scores (did this using a bit of windows freeware called BK ReplaceEM)
2 - Import into Excel
3 - Split up the scores into HAM and SPAM based on a threshold of 5 (this will obviously not be 100% accurate but is probably the only way to get an approximation?)
4 - Graph samples using ExAnalyze free excel plugin. HAM looks to be definately Normally Distributed (http://en.wikipedia.org/wiki/Normal_distribution). SPAM is a bit distorted due to the lower cutoff of 5 and the fact that you probably can't get SPAM scoring above a certain threshold due to the way the SpamAssassin rules are written - but it is roughly the right shape so it will have to do.
5 - Calculate the Mean and Standard Deviation for the HAM and SPAM scores:
Mean (Average): SPAM = 27.14, HAM = 0.52
Standard Deviation : SPAM = 10.96, HAM = 1.35
6 - Based on the way a Normal Distribution works (see above wikipedia entry) calculate the percentages of SPAM and HAM within certain ranges:
68% will fall within 1 standard deviation either side of the mean.
95% will fall within 2 standard deviations either side of the mean.
99.7% will fall within 3 standard deviations either side of the mean.
99.99% will fall within 4 standard deviations either side of the mean.
As we consider all email over a threshold to be junk (without an upper limit) and all email under a threshold to be OK (without a lower limit) we can adjust these percentages to take into account the one sided nature of our setup:
i.e. 68% of SPAM will fall within one standard deviation either side of the mean, but everything above the mean score is spam anyway. If we split the range down the middle then all email above the split will be junk (50%), and below the split, but within one standard deviation of the mean, will be a further half of our 68% junk email. This gives a total of 84% of junk email being detected with a threshold set one standard deviation below the mean.
84% will fall within 1 standard deviation of the mean.
97.5% will fall within 2 standard deviations of the mean.
99.85% will fall within 3 standard deviations of the mean.
99.995% will fall within 4 standard deviations of the mean.
7 - Applying these scores (and the stats taken from my maillog) to some possible SpamAsassin thresholds for Tagging (5) and Discarding (15) gives:
These thresholds would be discarding over 84% of SPAM and if you include email which would be Tagging but delivered then I would be catching over 97.5% of SPAM.
Looking at the statistics for HAM gives 99.85% of HAM being less that the Tagging threshold (or 15 emails out of every 10,000 being tagged by mistake), and basically none being discarded by mistake (significantly less than 5 in 100,000 emails).
OK so I have had to make a few approximations to do this and need to apply it to a MUCH larger and longer data set, and as with all statistics it should be taken with a pinch of salt. However it seems a useful thing to do in order to be able to come up with some estimates of the sucess and failure rate of a SpamAssassin setup, and to be able to make an educated decision about what thresholds to use...
Has anyone else tried this in the past or are there any scripts which do this automatically? If not then would people be interested in having one that does this sort of analysis automatically on their maillogs?
Any suggestions on how to improve this approach? It has been a long time since I studdied statistics!
I'm sure this could have been done directly on the server - will look into this at some point.
1 - Remove everything but the spamassassin scores (did this using a bit of windows freeware called BK ReplaceEM)
2 - Import into Excel
3 - Split up the scores into HAM and SPAM based on a threshold of 5 (this will obviously not be 100% accurate but is probably the only way to get an approximation?)
4 - Graph samples using ExAnalyze free excel plugin. HAM looks to be definately Normally Distributed (http://en.wikipedia.org/wiki/Normal_distribution). SPAM is a bit distorted due to the lower cutoff of 5 and the fact that you probably can't get SPAM scoring above a certain threshold due to the way the SpamAssassin rules are written - but it is roughly the right shape so it will have to do.
5 - Calculate the Mean and Standard Deviation for the HAM and SPAM scores:
Mean (Average): SPAM = 27.14, HAM = 0.52
Standard Deviation : SPAM = 10.96, HAM = 1.35
6 - Based on the way a Normal Distribution works (see above wikipedia entry) calculate the percentages of SPAM and HAM within certain ranges:
68% will fall within 1 standard deviation either side of the mean.
95% will fall within 2 standard deviations either side of the mean.
99.7% will fall within 3 standard deviations either side of the mean.
99.99% will fall within 4 standard deviations either side of the mean.
As we consider all email over a threshold to be junk (without an upper limit) and all email under a threshold to be OK (without a lower limit) we can adjust these percentages to take into account the one sided nature of our setup:
i.e. 68% of SPAM will fall within one standard deviation either side of the mean, but everything above the mean score is spam anyway. If we split the range down the middle then all email above the split will be junk (50%), and below the split, but within one standard deviation of the mean, will be a further half of our 68% junk email. This gives a total of 84% of junk email being detected with a threshold set one standard deviation below the mean.
84% will fall within 1 standard deviation of the mean.
97.5% will fall within 2 standard deviations of the mean.
99.85% will fall within 3 standard deviations of the mean.
99.995% will fall within 4 standard deviations of the mean.
7 - Applying these scores (and the stats taken from my maillog) to some possible SpamAsassin thresholds for Tagging (5) and Discarding (15) gives:
These thresholds would be discarding over 84% of SPAM and if you include email which would be Tagging but delivered then I would be catching over 97.5% of SPAM.
Looking at the statistics for HAM gives 99.85% of HAM being less that the Tagging threshold (or 15 emails out of every 10,000 being tagged by mistake), and basically none being discarded by mistake (significantly less than 5 in 100,000 emails).
OK so I have had to make a few approximations to do this and need to apply it to a MUCH larger and longer data set, and as with all statistics it should be taken with a pinch of salt. However it seems a useful thing to do in order to be able to come up with some estimates of the sucess and failure rate of a SpamAssassin setup, and to be able to make an educated decision about what thresholds to use...
Has anyone else tried this in the past or are there any scripts which do this automatically? If not then would people be interested in having one that does this sort of analysis automatically on their maillogs?
Any suggestions on how to improve this approach? It has been a long time since I studdied statistics!
Last edited: