Abstract:
SPAM stands for Short Pointless Annoying Messages, although they mayor may not be
always short. Unsolicited bulk email (UBE) or unsolicited commercial email (UCE) is
the practice of sending unwanted email messages, frequently with commercial content,
in large quantities to an indiscriminate set of recipients. Day by day, with the
exponential growth of the Internet it is harder to classify useful information from spam
mail. As spammers are becoming smarter and are constantly trying and updating
themselves to outsmart conventional 'static' methods such as keywords, blacklist,
whitelist, collaboration filtering and so on. These technologies are not obsolete but
cannot be relied effectively to face the email spam problem. In these circumstances, a
spam filtering process based on Bayesian filtering process has been developed. This
filter can continuously adapt and learn by itself resulting in a dynamic, adaptive,
'statistical intelligence' technique. Hence this filter can be made personalized and make
appropriate useable only for a particular user or organization. Thus the personalized
spam filter is being made personalized day by day.
To implement the project, the first efforts were given to find the way for making the
Bayes' conditional probability theory applicable for this project. Then the steps of
tokenization was started to find the best tokenization process to make the content-based
spam filtering successful. After the tokenization, the training SPAM and HAM database
was started to create, which was one of the most challenging, and time-consuming part
of the project. After that spam filtering was started using the filter and the filter was
being made personalized day by day exponentially and successfully as its name implies.