[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: spam filtering



 > From: http://www.gnosis.cx/~mertz (David Mertz, Ph.D.)
 > Date: Tue, 01 Oct 2002 12:14:59 -0400
 >
 > http://dummy.us.eu.org/robert (Robert) wrote:
 > |Your article might have also mentioned ifile
 > |http://www.ai.mit.edu/~jrennie/ifile
 > 
 > Yeah, maybe.  But it didn't look as easy to setup in the context of my
 > corpora statistics as were the little tools I used.  I guess I could
 > have listed ifile in the Resources, but there are literally dozens of
 > different implementations of Bayesian filtering, and I didn't want to
 > list them all.

Fair enough.  ifile has been around for 4 years and works specifically on
e-mail.  It's more general than any spam filtering tool 'though less general
than other Bayesian filtering tools.

 > I included Bogofilter in Resources because, from its description, it did
 > some work on improving the lexing algorithm.  But even there, I -tested-
 > with a quick Python program.  I don't think the results would differ
 > that much... and my point was to look at the conceptual area, not the
 > individual tools.

bogofilter is still in its infancy.  It has potential.  (I admit that
ifile ends up filtering on some funky "words" -- it ends up that words
like "<table" and "<font" are indicators of spam for my 5000 spam messages
of training.  bogofilter may eventually be better at recognizing real
words as indicators of spam.)

 > |don't know about Pyzor, but Razor uses "fuzzy matching" (nilsimsa) using a
 > |moving hash (sort of like rsync & the gdiff format).  Although I haven't
 > |looked at your trigram code, I imagine that it might be a similar idea
 > |(probably the fuzzy matching code is far more complex than the trigram
 > |stuff, 'though).
 > 
 > Not really the same.  Nilsimsa is interesting, as is the general concept
 > of a fuzzy hash.  But my use of trigrams has nothing to do with hashing.
 > The set of trigrams in a message is simply a bunch of data points for my
 > tool.  Just trigrams instead of words, like I wrote.  There's nothing
 > particularly complex or interesting about my trigram tool... except the
 > fact it actually *works* rather well.

That's interesting.  You've piqued my interest.  I tried downloading your
code, but it wouldn't work.  What version of python do you use?

Finally, just a note: your message was marked as spam by my filter because
there are spaces at the end of your Message-ID: line.  Luckily, I caught
this "false positive" and fixed my filter.  (The RFC822 standard says
that spaces are allowed anywhere around the values of header fields, so
my filter is indeed wrong.)

 > --
 >     _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/
 >    _/_/    ~~~~~~~~~~~~~~~~~~~~[http://www.gnosis.cx/~mertz]~~~~~~~~~~~~~~~~~~~~~  _/_/
 >   _/_/  The opinions expressed here must be those of my employer...   _/_/
 >  _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them!  _/_/




Why do you want this page removed?