Edit distance

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Different definitions of an edit distance use different sets of like operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

Types of edit distance

Different types of edit distance allow different sets of string operations. For instance:

{| class="wikitable"

|+ Algorithms and the operations allowed

!Algorithm

!Insertions

!Deletions

!Substitutions

!Transposition

|Levenshtein distance

|✓

|Longest common subsequence (LCS)

|✓

|Hamming distance

|✓

|Damerau–Levenshtein distance

|✓

|Jaro distance

|✓

Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Formal definition and properties

Given two strings and on an alphabet (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance is the minimum-weight series of edit operations that transforms into . One of the simplest sets of edit operations is that defined by Levenshtein in 1966:

Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes into .

Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost.

Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.

Common algorithm

Using Levenshtein's original operations, the (nonsymmetric) edit distance from <math>a = a_1\ldots a_m</math> to <math>b = b_1\ldots b_n</math> is given by <math>d_{mn}</math>, defined by the recurrence although it has a history of multiple invention. A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.

Improved algorithms

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants, one of which takes two strings and a maximum edit distance , and returns . It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time , where and are the lengths of the strings. Space complexity is or , depending on whether the edit sequence needs to be read off.

For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson having worst case runtime of O(nm/logn).

Applications

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance.
Approximate string matching can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string , called the pattern, and a constant ; it then builds a deterministic finite state automaton that finds, in an arbitrary string , a substring whose edit distance to is at most (cf. the Aho–Corasick algorithm, which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is the bitap algorithm, also defined in terms of edit distance.
Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.<math>d(L,x) = \min_{y \in L} d(x,y)

</math>, where <math>d(x,y)</math> is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance. For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance.

Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.

Types of edit distance

Formal definition and properties

Common algorithm

Improved algorithms

Applications

See also

References