HOW-TO enable users to easily train Bayesian data

IT_Architect

Verified User
Joined
Feb 27, 2006
Messages
888
HOW-TO enable users to easily train SpamAssassin's Bayesian data

Purpose: POP3 is a great way to go for email, and POPFile is psychic when it comes to keeping spam out of your inbox...on your client. The problem is today it's not just our PC anymore, we have mobile devices that displays the contents of our mailboxes, so now we need to figure out how to effectively filter out spam there. DA has a Help http://help.directadmin.com/item.php?id=358 that illustrates how to place a script in each user's .spamassassin so they can train their Bayesian data. As I started fiddling with their code to get it to work in my environment, I got distracted into making that functionality easily accessible to users and fully automated. This is what came out of that.

Credits: The Dutch Daemon and SeLLeRoNe for their su guidance, Arieh for the best place and cleanest way to get a list of DA users, and DA for a lot of the client-side code.

Concept: The solution is made up of two scripts, sa-teach_crawler, and sa-teach.
1. sa-teach_crawler - It is cronned to run as root, compiles and filters a list of DA users that are using spamassassin, and executes sa-teach AS THE DA USER to build each DA USER'S Bayesian data.
2. sa-teach - Looks in the VIRTUAL USERS' and DA USERs' mailboxes for the folders INBOX.teach-isspam and/or INBOX.teach-isnotspam. For each of these folders it calls SpamAssassin's sa-learn utility to process messages in those folders, and sync them to the Bayesian data for the DA USER, NOT THE VIRTUAL USER.

Requirements: The only things required are an IMAP client, create the IMAP folders INBOX.teach-isspam and INBOX.teach-isnotspam, enabling SpamAssassin, setting it to deliver spam to the user's spam folder, set the custom threshold for 2.6 for now, and press the button down on the bottom that reads "Delete Bayes Data". Virtual user can drag or copy files into those folder from any device, and it will teach itself. This requires no packages to be installed.

Code: The code is well documented
sa-teach_crawler The sa-teach_crawler.log shows when it ran last, which users sa-teach was called for, meaning they are using SpamAssassin, the time it took to process each user and the total run time of the script. It only records the latest run, so it does not grow.
Code:
#!/bin/sh
#********************************************************************#
#
#					SpamAssassin - sa-teach_crawler
#					(Calls sa-teach for each user so
#					(user can train their bayesian data)
#
#********************************************************************#
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

# ***** Define Variables *****
logpath="/var/log/sa-teach_crawler.log"
userlist_dir="/usr/local/directadmin/data/users"	# Directory of control panel users

# ***** Main *****
# Check & clean delete-teach-messages parameter
# DELETE_TEACH_MSGS=1 deletes teach messages after processing in inteach-isspam
# DELETE_TEACH_MSGS=2 deletes teach messages in both teach-isspam and teach-isnotspam
if [ -n "$1" ]; then
  DELETE_TEACH_MSGS="$1"
  if [ $DELETE_TEACH_MSGS -ne 0 ] && [ $DELETE_TEACH_MSGS -ne 1 ] && [ $DELETE_TEACH_MSGS -ne 2 ]; then
    DELETE_TEACH_MSGS=0
  fi
else
  DELETE_TEACH_MSGS=0
fi

echo "Started at $(date)">${logpath}
cd "$userlist_dir"
# Loop through users
for cp_user in * ; do		# Comment for testing to limit to one user
    #cp_user="<user name>"	# Uncomment and add user name for testing
    sa_dir="/home/$cp_user/.spamassassin"
	if [ -d "$sa_dir" ]; then
		if [ $DELETE_TEACH_MSGS -eq 0 ]; then
	      su -l "$cp_user" -c 'sa-teach 0' >/dev/null 2>&1
		elif [ $DELETE_TEACH_MSGS -eq 1 ]; then  
		  su -l "$cp_user" -c 'sa-teach 1' >/dev/null 2>&1
		else  
		  su -l "$cp_user" -c 'sa-teach 2' >/dev/null 2>&1
		fi
		echo "User $cp_user finished at $(date)">>${logpath}
	fi    
done						# Comment for testing to limit to one user
echo "Finished at $(date)">>${logpath}
sa-teach The sa-teach.log is created in the users' .spamassassin folder each time sa-teach runs for that user. If it finds no spam or ham directories to process, it will simply make an entry for the date an time, and a note that they are not participating in manual Bayesian training. If there are mailboxes with spam or ham directories, it will show the number of messages processed for each mailbox, the Bayesian statistics for that user, and the date and time it was run.
Code:
#!/bin/sh
# ********************************************************************
#
#					SpamAssassin - sa-teach
#					(Train users bayesian data
#					and records results in 
#					/home/user/.spamassassin/sa-teach.log)
#
# ********************************************************************
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

# ***** Define Variables *****
TEACH_SPAM_FOLDER=".INBOX.teach-isspam"
TEACH_HAM_FOLDER=".INBOX.teach-isnotspam"

# ***** Initialize Variables *****
isteaching_bayes=0
# Check & clean delete-teach-messages parameter
# DELETE_TEACH_MSGS=1 deletes teach messages after processing in inteach-isspam
# DELETE_TEACH_MSGS=2 deletes teach messages in both teach-isspam and teach-isnotspam
if [ -n "$1" ]; then
  DELETE_TEACH_MSGS="$1"
  if [ $DELETE_TEACH_MSGS -ne 0 ] && [ $DELETE_TEACH_MSGS -ne 1 ] && [ $DELETE_TEACH_MSGS -ne 2 ]; then
    DELETE_TEACH_MSGS=0
  fi
else
  DELETE_TEACH_MSGS=0
fi

# Set home path
DA_HOME=/home/${USER}

# Set working directory
cd $DA_HOME/.spamassassin

# Send output to the sa-teach.log file located in the users
# /home/user/.spamassassin directory.  It only logs the last run
# of sa-teach so it won't grow.  It can quicly tell you whether 
# the user or any virtual user has directories set up to train
# SpamAssassin's beyesian data.  It provides a quick reference to
# the users bayes statistics.  It includes the date and time of
# the last run so you will know if the teaching is actually occurring.
exec > sa-teach.log                                                                      
exec 2>&1

# ***** Functions *****
learn_Maildir()
{
            FILESPAM=${1}/${TEACH_SPAM_FOLDER}
            FILEHAM=${1}/${TEACH_HAM_FOLDER}

            if [ -e ${FILESPAM}/new ] || [ -e ${FILESPAM}/cur ]; then
            			isteaching_bayes=1
                        echo "learning spam via ${FILESPAM}...";
                        sa-learn --no-sync --spam  ${FILESPAM}/{cur,new}
            fi

            if [ -e ${FILEHAM}/new ] || [ -e ${FILEHAM}/cur ]; then
            			isteaching_bayes=1
                        echo "";
                        echo "learning ham via $FILEHAM...";
                        sa-learn --no-sync --ham ${FILEHAM}/{cur,new}
            fi

            # Delete teach messages if requested
            if [ "$DELETE_TEACH_MSGS" -eq 1 ] || [ "$DELETE_TEACH_MSGS" -eq 2 ]; then
                        rm -f ${FILESPAM}/new/* ${FILESPAM}/cur/*
            fi
            if [ "$DELETE_TEACH_MSGS" -eq 2 ]; then
			            rm -f ${FILEHAM}/new/* ${FILEHAM}/cur/*
			fi

}

# ***** Main *****
# Learn from user mailbox
if [ -e $DA_HOME/Maildir ]; then
     learn_Maildir $DA_HOME/Maildir
fi

# Learn from virtual user mailboxes
for d in `ls $DA_HOME/imap`; do
{
            DOMAIN_DIR=${DA_HOME}/imap/${d}
            if [ -h $DOMAIN_DIR ]; then
                        continue;
            fi

            for maildir in `ls -d ${DOMAIN_DIR}/*/Maildir 2>/dev/null`; do
            {
                learn_Maildir ${maildir}
            };
            done;
};
done;

if [ $isteaching_bayes -eq 1 ]; then

	# Commit learning to database
	echo "";
	echo "syncing...";
	sa-learn --sync

	# Show stats after commit
	echo "";
	echo "current status:"
	sa-learn --dump magic

	echo Delete Teach Data is "$DELETE_TEACH_MSGS"
else
	echo "User $USER is not manually teaching Bayesian data."
fi
echo "Generated $(date)"
# ***** End Script *****
crontab entry The 1 on the end of the command tells sa-teach to remove messages found in the .INBOX.teach-isspam directories after it finishes processing. A 2 on the end of the command tells sa-teach to remove messages found in both the .INBOX.teach-isspam and .INBOX.teach-isnotspam directories. Passing nothing, or anything other than 1 or 2, will leave the messages untouched. The next time sa-teach is called, sa-learn will not process messages that have already been processed.
Code:
##### SpamAssassin - Teach user Bayesian data from virtual users' spam and ham directories
*/5	*	*	*	*	root	/usr/bin/sa-teach_crawler 1
Usage Comments:
- If you call sa-teach_crawler with 1 as a parameter, it removes the spam teach messages after processing. If you specify 2, it removes both spam teach messages and ham teach messages, so remember to copy them in there, not drag, if you need to keep them.
- If you read how SpamAssassin's automatic Bayes works, it is a hokey concept, but then what else could it be with no training. That also means that the garbage that has been building up in your Bayes data for eons will work against you. If you don't press the "Delete Bayes Data" button on the bottom of the SpamAssassin screen to flush out the old data, Halley's Comet will be back before you get that thing trained.
- When setting SpamAssassin option, you want to have the spam go to the user's spam folder in most cases because from IMAP you will be able to see that folder to check and empty it.
- As I said earlier POPFile is psychic. It almost never has a false positive or false negative after a little training. I set up an IMAP account in Outlook for the same email address as the POP3. I modified the same rule that I have in Outlook that sends spam to Junk E-Mail in POP3, to send a copy of the spam to the IMAP teach-isspam folder. Thus, with sa-teach_crawler running every 5 minutes, it trains itself. They say to throw in some ham into the teach ham folder too. I threw a few in, but I like the way things are working so well that I'm not going to rock the boat.

Comments and Disclaimers:
- The code is not well tested. I thought of and made changes while I was writing this.
- I tried my best to make this work on any DA OS, but I switched from Linux to FreeBSD quite a few years ago, so I don't know what I'm doing in the Linuxes anymore.
- When it comes to coding scripts, I'm no SMTalk, so if you have some things that should be fixed or cleaned up, let me know.
- If DA wants to hijack this into their Help topic like I hijacked some of their code from their Help topic, and improve it, feel free.
- I'm thinking about adding code that automatically creates the needed directories everywhere so users simply have to turn on SpamAssassin and drag files to the appropriate directories.
- We understand how this all works, but I don't have anything put together to hand a customer of how to configure it yet. I'll save a spot under this article for that, but I'm out of time for now.
 
Last edited:

IT_Architect

Verified User
Joined
Feb 27, 2006
Messages
888
For future documentation and revision comments. If I make a code change, to the above, I will document it here.
 
Last edited:

ricardo777

Verified User
Joined
Mar 29, 2012
Messages
80
Location
Netherlands
I didn't find it while searching here, but the author did PM me after I posted it that he had posted one on a thread. I looked and didn't find it again, so thanks for the link.
Yes I had the script saved on my cloud, but couldn;t find the post also have looked for a very long time and found it finally.

Is the script you provide much different then his script? and which one does a better job. We now use his script for a very long time. (Must say that the spam still goes through it.)
 

SeLLeRoNe

Super Moderator
Joined
Oct 9, 2004
Messages
6,788
Location
A Coruña, Spain
Not much different, actually it look on the INBOX folder for non-spam and in Junk folder for Spam, so it also whitelist the thing you keep in INBOX :)

It also have the set to remove or not Spam once is learned, but i use to keep it disabled.

My script actually take a long time to run (for big email account) since it check for INBOX folder :)

But, mainly they do the same thing, if you want it to take a deeper look into it just drop me a PM ;)

Regards
 

TomJones

Verified User
Joined
May 9, 2004
Messages
59
Not to hijack this old thread, but I spent all day trying to implement this on my cent7 server without getting it to process the users. The crawler would run successfully, I finally tried running just sa-teach with:
su -l (DAUserName) -c 'sa-teach 2'
and I got a Permission Denied. I did a google on the error message and one of the top links was to Polarix's https://github.com/poralix/directadmin-teach-sa script. It took about 2 minutes to get it installed and running successfully. Maybe this will help someone in the future that happens upon this thread.
 
Top