Robert's Implementation of a Fuzzy Hash Sum Program

This is a program which will generate an md5sum-like 128-bit checksum hex number which is the approximate, fuzzy "signature" of a message. It is very similar to nilsimsa (also here), but uses a completely different algorithm. fuzzysum depends upon dcdflib.c-1.1 (just for the student's (William T. Gosset's) t-score/t-test statistic function) which can be obtained here.

The fuzzysum.cc program may be obtained here. The algorithm works on alphanumeric-based words. I use this program to assure that I don't see the same or even a similar message twice when retrieving Usenet messages. The Message-ID field of the message is overwritten with the fuzzysum value and then using procmail's "formail -D" to cache the value in a file to see if the message has already been received. (People sometimes send the same message twice.)

Below is a snippet of how it might be used in a .procmailrc file:

:0hc
SUBJECT=|formail -x Subject:|cut -c2-71
:ac
HASH=|(echo "$SUBJECT";formail -I "")|fuzzysum
:ahf
|formail -i "Message-ID: $HASH"
:a
*?formail -D 122880 fuzzyxrefs
/dev/null

The goal, like md5sum, is to create a 128-bit number representing the [vague] "signature" of the message. Each word in the text is hashed and placed into one of 128 buckets. The count for that bucket is incremented for each "hit" upon that bucket. At the end, the mean of all the bucket counts is computed. For each bucket, any value above the confidence level (usually about 1 sigma, ~60%), that the bit representing that bucket gets a 1, otherwise it gets a 0. The resulting 128 bucket bits are joined together for a final hex output value.

An argument may be given to fuzzysum. This is the confidence level, a floating-point number. It defaults to 60 (for 60%). The lower this confidence level, the less "fuzzy" it becomes; the higher, the more so.

One improvement that may want to be made is to include multi-word hashing, as is done with the sparse binary polynomial matching algorithm in CRM114. I've experimented with string hashing (using the last 5 characters, 4 chars, 3 chars, etc.) (in the spirit of gdiff/ rsync) instead of the word-based tokenizing, but it performed worse. Jason Rennie has done some of this run-length hashing with successful results, so I may be doing something wrong.

Robert's Free Software

Date Last Modified: Wed Jun 21 12:30:17 UTC 2017