The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem (e.g. the full sequence) into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.
Introduction
This algorithm can be used for any two strings. This guide will use two small DNA sequences as examples as shown in Figure 1:
GCATGCG
GATTACA
Constructing the grid
First construct a grid such as one shown in Figure 1 above. Start the first string in the top of the third column and start the other string at the start of the third row. Fill out the rest of the column and row headers as in Figure 1. There should be no numbers in the grid yet.
{| class="wikitable"
|-
!| || || G || C || A || T || G || C || G
|-
! scope="row" |
| || || || || || || ||
|-
! scope="row" | G
| || || || || || || ||
|-
! scope="row" | A
| || || || || || || ||
|-
! scope="row" | T
| || || || || || || ||
|-
! scope="row" | T
| || || || || || || ||
|-
! scope="row" | A
| || || || || || || ||
|-
! scope="row" | C
| || || || || || || ||
|-
! scope="row" | A
| || || || || || || ||
|}
Choosing a scoring system
Next, decide how to score each individual pair of letters. Using the example above, one possible alignment candidate might be:
12345678
The letters may match, mismatch, or be matched to a gap (a deletion or insertion (indel)):
- Match: The two letters at the current index are the same.
- Mismatch: The two letters at the current index are different.
- Indel (Insertion or Deletion): The best alignment involves one letter aligning to a gap in the other string.
Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate. Different systems exist for assigning scores; some have been outlined in the Scoring systems section below. For now, the system used by Needleman and Wunsch It has been shown that it is possible to improve the running time to <math>O(mn/ \log n)</math> using the Method of Four Russians. Since the algorithm fills an <math>n \times m</math> table the space complexity is <math>O(mn).</math> by David Sankoff in 1972.
Similar quadratic-time algorithms were discovered independently
by T. K. Vintsyuk in 1968 for speech processing
("time warping"), and by Robert A. Wagner and Michael J. Fischer in 1974 for string matching.
Needleman and Wunsch formulated their problem in terms of maximizing similarity. Another possibility is to minimize the edit distance between sequences, introduced by Vladimir Levenshtein. Peter H. Sellers showed in 1974 that the two problems are equivalent.
The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. However, the algorithm is expensive with respect to time and space, proportional to the product of the length of two sequences and hence is not suitable for long sequences.
Recent development has focused on improving the time and space cost of the algorithm while maintaining quality. For example, in 2013, a Fast Optimal Global Sequence Alignment Algorithm (FOGSAA), suggested alignment of nucleotide/protein sequences faster than other optimal global alignment methods, including the Needleman–Wunsch algorithm. The paper claims that when compared to the Needleman–Wunsch algorithm, FOGSAA achieves a time gain of 70–90% for highly similar nucleotide sequences (with > 80% similarity), and 54–70% for sequences having 30–80% similarity.
Applications outside bioinformatics
Computer stereo vision
Stereo matching is an essential step in the process of 3D reconstruction from a pair of stereo images. When images have been rectified, an analogy can be drawn between aligning nucleotide and protein sequences and matching pixels belonging to scan lines, since both tasks aim at establishing optimal correspondence between two strings of characters.
Although in many applications image rectification can be performed, e.g. by camera resectioning or calibration, it is sometimes impossible or impractical since the computational cost of accurate rectification models prohibit their usage in real-time applications. Moreover, none of these models is suitable when a camera lens displays unexpected distortions, such as those generated by raindrops, weatherproof covers or dust. By extending the Needleman–Wunsch algorithm, a line in the 'left' image can be associated to a curve in the 'right' image by finding the alignment with the highest score in a three-dimensional array (or matrix). Experiments demonstrated that such extension allows dense pixel matching between unrectified or distorted images.
Artificial neural networks
The Needleman–Wunsch algorithm has also been applied to the comparison of artificial neural network architectures. Inspired by how global sequence alignment identifies similarities in the aforementioned biological sequences, this approach represents neural networks as sequences of computational layers or blocks. By aligning these sequences, architectural similarity can be quantified in a principled manner.
This method has been successfully used in neural architecture search (NAS), particularly in approaches based on evolutionary algorithms. Architectural similarity can be assessed without requiring model training, enabling more efficient guidance of diversity during population-based search, preventing premature convergence.
See also
- Wagner–Fischer algorithm
- Smith–Waterman algorithm
- Sequence mining
- Levenshtein distance
- Dynamic time warping
- Sequence alignment
References
External links
- NW-align: A protein sequence-to-sequence alignment program by Needleman-Wunsch algorithm (online server and source code)
- A live demonstration of the algorithm
