[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: clustering



 > From: leonid leibman <http://profiles.yahoo.com/lleibman>
 > Date: Thu, 31 Mar 2005 07:43:52 -0800 (PST)
 >
 > Hi, Robert -- I was reading recently about clustering
 > search engines (like clusty). Do you know what
 > technology is behind it?

No, not specifically.  I think it clusters by words and may even use an
encyclopedia.  (Clusty used to be vivisimo and, at the time, I was very
impressed.  It's still pretty neat, but don't know exactly how it works.)

 > In general what's your opinion about the state of the
 > art in clustering (if any :)? Google says that it's
 > not using it (yet) since it is of limited use.

In the clusty sense, I think it is of limited use.  But, I think combining
recommendation-type systems (where users have certain interests) and
search engines could be really great.  http://www.directhit.com was
working in this direction before askjeeves acquired it and quashed that
part completely.

 > links_2_links clustering (disambiguation) was kind of weak
 > and I was trying to work on some ideas to improve it
 > but I'm realizing that the state of the art may have
 > improved quite a bit since then.

Links_2_Links had almost no disambiguation.  It's an extremely hard problem and
there may be research written about it, but I know of no specific
technology which addresses disambiguation.

 > Also, do you know of any simple data
 > extraction/classification freeware/shareware?

I don't know what you mean by that.  My spam filter uses ifile which uses
Naive Bayes for its classification.  It is word-based (although I augment
that by combining word pairs through a separate program).  There are a
number of open source machine learning libraries.  I think I remember that
I was most impressed with Torch.

 > On a different topic, as far as interfaces to search
 > go I'm imagining that one can have drag and drop
 > boxes. Say 3 ("Good match", "Irrelevant match" and
 > "Undesirable match"). As a first step the user enters
 > keywords. Then he can place the results in the boxes
 > and the priorities of the results (and thus the
 > results you'll see) will change accordingly. This is
 > possible if there is a good clustering software behind
 > the scenes. Does this sound like a good idea to you?

Sure.  'Though Amazon, Netflix, and Movielens use a simple star-based
rating system which may be faster for most users than dragging items.  I
don't know.

 > Leonid
 > 
 > Leonid Leibman




Why do you want this page removed?