http://dummy.us.eu.org/robert (Robert) wrote: |Your article might have also mentioned ifile |http://www.ai.mit.edu/~jrennie/ifile Yeah, maybe. But it didn't look as easy to setup in the context of my corpora statistics as were the little tools I used. I guess I could have listed ifile in the Resources, but there are literally dozens of different implementations of Bayesian filtering, and I didn't want to list them all. I included Bogofilter in Resources because, from its description, it did some work on improving the lexing algorithm. But even there, I -tested- with a quick Python program. I don't think the results would differ that much... and my point was to look at the conceptual area, not the individual tools. |don't know about Pyzor, but Razor uses "fuzzy matching" (nilsimsa) using a |moving hash (sort of like rsync & the gdiff format). Although I haven't |looked at your trigram code, I imagine that it might be a similar idea |(probably the fuzzy matching code is far more complex than the trigram |stuff, 'though). Not really the same. Nilsimsa is interesting, as is the general concept of a fuzzy hash. But my use of trigrams has nothing to do with hashing. The set of trigrams in a message is simply a bunch of data points for my tool. Just trigrams instead of words, like I wrote. There's nothing particularly complex or interesting about my trigram tool... except the fact it actually *works* rather well. -- _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/ _/_/ ~~~~~~~~~~~~~~~~~~~~[http://www.gnosis.cx/~mertz]~~~~~~~~~~~~~~~~~~~~~ _/_/ _/_/ The opinions expressed here must be those of my employer... _/_/ _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/