> From: http://www.gnosis.cx/~mertz (David Mertz, Ph.D.) > Date: Tue, 01 Oct 2002 12:14:59 -0400 > > http://dummy.us.eu.org/robert (Robert) wrote: > |Your article might have also mentioned ifile > |http://www.ai.mit.edu/~jrennie/ifile > > Yeah, maybe. But it didn't look as easy to setup in the context of my > corpora statistics as were the little tools I used. I guess I could > have listed ifile in the Resources, but there are literally dozens of > different implementations of Bayesian filtering, and I didn't want to > list them all. Fair enough. ifile has been around for 4 years and works specifically on e-mail. It's more general than any spam filtering tool 'though less general than other Bayesian filtering tools. > I included Bogofilter in Resources because, from its description, it did > some work on improving the lexing algorithm. But even there, I -tested- > with a quick Python program. I don't think the results would differ > that much... and my point was to look at the conceptual area, not the > individual tools. bogofilter is still in its infancy. It has potential. (I admit that ifile ends up filtering on some funky "words" -- it ends up that words like "<table" and "<font" are indicators of spam for my 5000 spam messages of training. bogofilter may eventually be better at recognizing real words as indicators of spam.) > |don't know about Pyzor, but Razor uses "fuzzy matching" (nilsimsa) using a > |moving hash (sort of like rsync & the gdiff format). Although I haven't > |looked at your trigram code, I imagine that it might be a similar idea > |(probably the fuzzy matching code is far more complex than the trigram > |stuff, 'though). > > Not really the same. Nilsimsa is interesting, as is the general concept > of a fuzzy hash. But my use of trigrams has nothing to do with hashing. > The set of trigrams in a message is simply a bunch of data points for my > tool. Just trigrams instead of words, like I wrote. There's nothing > particularly complex or interesting about my trigram tool... except the > fact it actually *works* rather well. That's interesting. You've piqued my interest. I tried downloading your code, but it wouldn't work. What version of python do you use? Finally, just a note: your message was marked as spam by my filter because there are spaces at the end of your Message-ID: line. Luckily, I caught this "false positive" and fixed my filter. (The RFC822 standard says that spaces are allowed anywhere around the values of header fields, so my filter is indeed wrong.) > -- > _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/ > _/_/ ~~~~~~~~~~~~~~~~~~~~[http://www.gnosis.cx/~mertz]~~~~~~~~~~~~~~~~~~~~~ _/_/ > _/_/ The opinions expressed here must be those of my employer... _/_/ > _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/