[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: spam filtering
- To: http://www.gnosis.cx/~mertz
- Subject: Re: spam filtering
- From: http://dummy.us.eu.org/robert (Robert)
- Date: Tue, 1 Oct 2002 17:31:03 -0400
- In-Reply-To: <http://www.gnosis.cx/~Docm9kKkX8be092yn>
> From: http://www.gnosis.cx/~mertz (David Mertz, Ph.D.)
> Date: Tue, 01 Oct 2002 12:14:59 -0400
>
> http://dummy.us.eu.org/robert (Robert) wrote:
> |Your article might have also mentioned ifile
> |http://www.ai.mit.edu/~jrennie/ifile
>
> Yeah, maybe. But it didn't look as easy to setup in the context of my
> corpora statistics as were the little tools I used. I guess I could
> have listed ifile in the Resources, but there are literally dozens of
> different implementations of Bayesian filtering, and I didn't want to
> list them all.
Fair enough. ifile has been around for 4 years and works specifically on
e-mail. It's more general than any spam filtering tool 'though less general
than other Bayesian filtering tools.
> I included Bogofilter in Resources because, from its description, it did
> some work on improving the lexing algorithm. But even there, I -tested-
> with a quick Python program. I don't think the results would differ
> that much... and my point was to look at the conceptual area, not the
> individual tools.
bogofilter is still in its infancy. It has potential. (I admit that
ifile ends up filtering on some funky "words" -- it ends up that words
like "<table" and "<font" are indicators of spam for my 5000 spam messages
of training. bogofilter may eventually be better at recognizing real
words as indicators of spam.)
> |don't know about Pyzor, but Razor uses "fuzzy matching" (nilsimsa) using a
> |moving hash (sort of like rsync & the gdiff format). Although I haven't
> |looked at your trigram code, I imagine that it might be a similar idea
> |(probably the fuzzy matching code is far more complex than the trigram
> |stuff, 'though).
>
> Not really the same. Nilsimsa is interesting, as is the general concept
> of a fuzzy hash. But my use of trigrams has nothing to do with hashing.
> The set of trigrams in a message is simply a bunch of data points for my
> tool. Just trigrams instead of words, like I wrote. There's nothing
> particularly complex or interesting about my trigram tool... except the
> fact it actually *works* rather well.
That's interesting. You've piqued my interest. I tried downloading your
code, but it wouldn't work. What version of python do you use?
Finally, just a note: your message was marked as spam by my filter because
there are spaces at the end of your Message-ID: line. Luckily, I caught
this "false positive" and fixed my filter. (The RFC822 standard says
that spaces are allowed anywhere around the values of header fields, so
my filter is indeed wrong.)
> --
> _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/
> _/_/ ~~~~~~~~~~~~~~~~~~~~[http://www.gnosis.cx/~mertz]~~~~~~~~~~~~~~~~~~~~~ _/_/
> _/_/ The opinions expressed here must be those of my employer... _/_/
> _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/