Bayes?

Strator

Verified User
Joined
Jan 19, 2011
Messages
283
I've started to train my bayes database today to improve spam filtering:

sa-learn -D --spam /my-spam/*.*

I've noticed, however, that this command updates the spam database in /root/ which is fairly small - whereas it seems like the bigger/active bayes databases are actually located in /home/.

Does the bayes database in /root/ get used at all? If not, how can I use a global bayes database for all mail on my server? Obviously, it's a nightmare to manually update all bayes databases for each account. Thanks!
 
Hm... thanks for the reply, but I'm not entirely sure if you have understood my question. Maybe let me word it differently: How can I make sure that the one (1) bayesian database I train gets used by all mail accounts on the server?
 
See my examples bellow:

Running from regular username:

Code:
# sa-learn --ham /home/user/imap/domain/alex/Maildir/cur/ -u user
Learned tokens from 28 message(s) (28 message(s) examined)
[root@vps1 cur]# sa-learn --dump magic -u user
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0         28          0  non-token data: nham
0.000          0       7141          0  non-token data: ntokens
0.000          0 1305804950          0  non-token data: oldest atime
0.000          0 1324310611          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count
[root@vps1 cur]# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0         28          0  non-token data: nham
0.000          0       7141          0  non-token data: ntokens
0.000          0 1305804950          0  non-token data: oldest atime
0.000          0 1324310611          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count
[root@vps1 cur]#


Running from root:

Code:
[root@vps1 cur]# [root@vps1 cur]# sa-learn --ham /home/user/imap/domain/alex/Maildir/.INBOX.Misc/cur/
Learned tokens from 183 message(s) (183 message(s) examined)
[root@vps1 cur]# sa-learn --dump magic -u user
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0        211          0  non-token data: nham
0.000          0      20068          0  non-token data: ntokens
0.000          0 1266521347          0  non-token data: oldest atime
0.000          0 1324310611          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count
[root@vps1 cur]# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0        211          0  non-token data: nham
0.000          0      20068          0  non-token data: ntokens
0.000          0 1266521347          0  non-token data: oldest atime
0.000          0 1324310611          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count


See

Code:
man sa-learn

for details (it was already mentioned above... a link http://spamassassin.apache.org/full/3.3.x/doc/sa-learn.html).


--dump option

Display the contents of the Bayes database. Without an option or with the all option, all magic tokens and data tokens will be displayed. magic will only display magic tokens, and data will only display the data tokens.
 
Last edited:
Sorry, but I still don't understand what your posts have to do with my original question.
 
If you want to understand, please, read the article by the link.
And post here results for (run from root)

Code:
# sa-learn --dump magic -u any_actual_username_but_not_root_here

and

Code:
# sa-learn --dump magic
 
I appreciate you trying to help, but there is really no benefit in me trying to understand a solution that doesn't apply to my problem.

Please kindly reread my initial posts - I've worded my issue twice, there's really not more I can do on my end for YOU to understand. Thank you.
 
My examples gave you an answer on your question. They show all the things you exactly wanted to know according to your question. If you want detailed description or answer in another form, you'd better read the documentation, as all necessary tools you already have got. If you do not know how to use the tools or how to read the results of them, please read docs.
 
Alex, I didn't ask for tools. First and foremost I wanted to know if the bayes database that's located in the root directory gets used at all. You can't answer that question by a link to the spam assassin manual, because it's a matter how direct admin handles SA. The answer is either a "yes" or a "no".

If the answer is "no" (which I suspect), I wanted to know how to configure DA/SA so a single, global bayes database is used for all mail accounts on the same server.

Your reply was that I "should run it from general username". Hm. To be honest, I don't even have a clue what you are trying to tell me here. Most of bayes training happens automatically (without me running any sa-learn manually). That's how I noticed that the bayes files in the root directory are essentially empty, while the ones in the home directories are much larger, depending on the usage of the various mail accounts. So your examples on how to learn ham from two different directories are going to help me in which way?
 
OK, first of all, SpamAssassin which comes with Directadmin does not any differ from that one, which you (or anybody else) would install on a server without DA with yum/rpm/apt-get/aptitude/portinstall/pkg_add. Thus you should deal with SA as if you had not installed Directadmin. And as soon as you wish to get closer with SpamAssassin (at least I presume that you wish it), you really should read documentation and manual.

How can we practically check if bayes located in root directory is used by any other user? How can we practically check if the bayes get updated? Here comes the manual to help:

--dump option

Display the contents of the Bayes database. Without an option or with the all option, all magic tokens and data tokens will be displayed. magic will only display magic tokens, and data will only display the data tokens.

So with that command we can at least check if root Bayes database get updated if you teach SpamAssassin from regular (general, or not-root) user.

Note, SpamAssassin can be learned by two ways: automatically and manually. Check your /etc/mail/spamassassin/local.cf (if it exists of course). I've got there

Code:
bayes_auto_learn 1

OK, there is a brilliant parameter as


Which can be used with

sa-learn
sa-update
spamassassin

And if you run sa-learn with -D you might see some useful details:

Code:
sa-learn --ham  /home/username/imap/domain.com/alex/Maildir/cur/ -D 2>&1 | less

e.g.

Code:
Dec 21 00:58:35.778 [5683] dbg: bayes: tie-ing to DB file R/O /etc/mail/spamassassin/bayes-shared/bayes_toks
Dec 21 00:58:35.778 [5683] dbg: bayes: tie-ing to DB file R/O /etc/mail/spamassassin/bayes-shared/bayes_seen

Here you can see what Bayes Database is used.

And if you run spamassassin in test mode with -D you also can see useful details, about what exact Database is used. Anyway if you try to make spamassassin to use Bayes Database located in /root/.spamassassin, regular user won't be able to access it, while /root/ directory is chmoded to 700 or 710. But there is a way out, and here we should read this http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options and this http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#administrator_settings

bayes_path /path/filename (default: ~/.spamassassin/bayes)

This is the directory and filename for Bayes databases. Several databases will be created, with this as the base directory and filename, with _toks, _seen, etc. appended to the base. The default setting results in files called ~/.spamassassin/bayes_seen, ~/.spamassassin/bayes_toks, etc.

By default, each user has their own in their ~/.spamassassin directory with mode 0700/0600. For system-wide SpamAssassin use, you may want to reduce disk space usage by sharing this across all users. However, Bayes appears to be more effective with individual user databases.

bayes_file_mode (default: 0700)

The file mode bits used for the Bayesian filtering database files.

Make sure you specify this using the 'x' mode bits set, as it may also be used to create directories. However, if a file is created, the resulting file will not have any execute bits set (the umask is set to 111). The argument is a string of octal digits, it is converted to a numeric value internally.

I guess, it should be now clear, as I don't even have anything to add.

As you probably saw, my Bayes Database is located in /etc/mail/spamassassin/bayes-shared/ and anybody can read it and write there if it's needed. Of course there is a way to store Bayes Database in MySQL table, but if you want details, please Google it. As Directadmin has very little to do with SA.


p.s. I hope, it's now clear, why I gave you the examples in previous posts, and why I wanted you to read docs. And I hope, you'll read the docs.
 
bayes_path /path/filename (default: ~/.spamassassin/bayes)

This is the directory and filename for Bayes databases. Several databases will be created, with this as the base directory and filename, with _toks, _seen, etc. appended to the base. The default setting results in files called ~/.spamassassin/bayes_seen, ~/.spamassassin/bayes_toks, etc.

By default, each user has their own in their ~/.spamassassin directory with mode 0700/0600. For system-wide SpamAssassin use, you may want to reduce disk space usage by sharing this across all users. However, Bayes appears to be more effective with individual user databases.

bayes_file_mode (default: 0700)

The file mode bits used for the Bayesian filtering database files.

Make sure you specify this using the 'x' mode bits set, as it may also be used to create directories. However, if a file is created, the resulting file will not have any execute bits set (the umask is set to 111). The argument is a string of octal digits, it is converted to a numeric value internally.
This is really the only part that is relevant to me, and I thank you for pointing it out. Over the past decade I had been administrating several servers, both with Cpanel and with no control panel at all, and the bayes database was always global, so I was certain that it's due to Direct Admin that it's NOT. Apparently, I was wrong. Thanks for your help, and also the comments on why I can't just go ahead and specify the files under the root path as the new global bayes.
 
Back
Top