2007年3月7日星期三

Levenshtein Distance

The Levenshtein distance algorithm is used In Class FuzzyQuery

What is Levenshtein Distance?[http://www.merriampark.com/ld.htm]

Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example, If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t. The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
The Levenshtein distance algorithm has been used in:
Spell checking、Speech recognition 、DNA analysis 、Plagiarism detection


FuzzyQuery distance formula = 1 - distance/min(textlen,targetlen)

FuzzyQuery enumerates all terms in an index to find terms within theallowable threshold. Use this type of query sparingly, or at least with theknowledge of how it works and the effect it may have on performance.

没有评论: