In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated sequence of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as RNA polymerase.
To address the limitations of consensus sequences—which reduce variability to a single residue per position—sequence logos provide a richer visual representation of aligned sequences. Logos display each position as a stack of letters (nucleotides or amino acids), where the height of a letter corresponds to its frequency in the alignment, and the total stack height reflects the information content (measured in bits). The most frequent residue appears at the top of the stack, preserving the consensus while also revealing subtle patterns, such as functionally important but less frequent residues (e.g., alternative start codons or transcription factor binding sites).
thumb|Example of consensus sequence of nucleotides
Biological significance
A protein binding site, represented by a consensus sequence, may be a short sequence of nucleotides which is found several times in the genome and is thought to play the same role in its different locations. For example, many transcription factors recognize particular patterns in the promoters of the genes they regulate. In the same way, restriction enzymes usually have palindromic consensus sequences, usually corresponding to the site where they cut the DNA. Transposons act in much the same manner in their identification of target sequences for transposition. Finally, splice sites (sequences immediately surrounding the exon-intron boundaries) can also be considered as consensus sequences.
Thus a consensus sequence is a model for a putative DNA binding site: it is obtained by aligning all known examples of a certain recognition site and defined as the idealized sequence that represents the predominant base at each position. All the actual examples shouldn't differ from the consensus by more than a few substitutions, but counting mismatches in this way can lead to inconsistencies.
Software
Bioinformatics tools are able to calculate and visualize consensus sequences. Examples of the tools are JalView and UGENE.
See also
- Position-specific scoring matrix
- Regular expression — denoting multiple sequences of symbols in formal language theory
- Sequence motif
- Sequence logo
