379

Bayesian Learning tips and tricks

Bayesian Learning tips and tricks

As a spam-fighting tool, Bayesian filtering 'learns' to detect junk mail and legitimate mail by analyzing the header, subject and content of received messages known for sure to be either spam or non-spam. The Bayesian process assigns a spam-probability to each word, domain name, HTML code or other 'token' in each message. Bayesian filtering then uses this data to determine if new incoming messages are likely spam or non-spam. Because Bayesian filtering analyzes messages received at each email server, the 'token' probabilities are site-specific. As part of the Bayesian process, MDaemon has tools for setting up separate folders to receive copies of messages known to be spam and known to be legitimate mail. Bayesian filtering obtains its data by analyzing the messages in these folders. By regularly adding new known spam and non-spam into the Bayesian system, spam filtering 'learns' to be more reliable in distinguishing between the two over time for each email server. MDaemon uses the Bayesian results to further refine the 'scores' it assigns to messages.

Bayesian will start scoring messages after 200 spam and non-spam messages are fed to it. How accurate it is relies on the way in which it is fed.  Here are some tips:

  1. The more the merrier - be sure to continually feed the spam and non-spam folders even after scoring has started.  Try to feed it similar amounts of each type of mail.  (If anything, feed it more non-spam than spam.)
  2. Feed it a variety of messages - it's important to feed both spam and non-spam from varied sources with various bodies.  Don't just feed it spam/non-spam addressed to one user, too.
  3. Manually review messages before they are used in the learning process - it is necessary for someone to review each message to ensure it's in the right place.  Unsupervised learning is one of the easiest ways to ruin all your hard work.
  4. Feed it mistakes as well as mail that wasn't scored - if the Bayesian filter scores a message wrong (or the regular non-Bayes scoring) you should feed that message.  You should also feed ones that weren't scored at all by Bayes.