Serial analysis of gene expression

thumb|500x500px|Summary of SAGE. Within the organisms, genes are [[Transcription (biology)|transcribed and spliced (in eukaryotes) to produce mature mRNA transcripts (red). The mRNA is extracted from the organism, and reverse transcriptase is used to copy the mRNA into stable double-stranded–cDNA (ds-cDNA; blue). In SAGE, the ds-cDNA is digested by restriction enzymes (at location 'X' and 'X'+11) to produce 11-nucleotide 'tag' fragments. These tags are concatenated and sequenced using long-read Sanger sequencing (different shades of blue indicate tags from different genes). The sequences are deconvoluted to find the frequency of each tag. The tag frequency can be used to report on transcription of the gene that the tag came from.]]

Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. Several variants have been developed since, most notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE. Many of these have improved the technique with the capture of longer tags, enabling more confident identification of a source gene.

Overview

Briefly, SAGE experiments proceed as follows:

The mRNA of an input sample (e.g. a tumour) is isolated and a reverse transcriptase and biotinylated primers are used to synthesize cDNA from mRNA.
The cDNA is bound to Streptavidin beads via interaction with the biotin attached to the primers, and is then cleaved using a restriction endonuclease called an anchoring enzyme (AE). The location of the cleavage site and thus the length of the remaining cDNA bound to the bead will vary for each individual cDNA (mRNA).
The cleaved cDNA downstream from the cleavage site is then discarded, and the remaining immobile cDNA fragments upstream from cleavage sites are divided in half and exposed to one of two adaptor oligonucleotides (A or B) containing several components in the following order upstream from the attachment site: 1) Sticky ends with the AE cut site to allow for attachment to cleaved cDNA; 2) A recognition site for a restriction endonuclease known as the tagging enzyme (TE), which cuts about 15 nucleotides downstream of its recognition site (within the original cDNA/mRNA sequence); 3) A short primer sequence unique to either adaptor A or B, which will later be used for further amplification via PCR.
After adaptor ligation, cDNA are cleaved using TE to remove them from the beads, leaving only a short "tag" of about 11 nucleotides of original cDNA (15 nucleotides minus the 4 corresponding to the AE recognition site).
The cleaved cDNA tags are then repaired with DNA polymerase to produce blunt end cDNA fragments.
These cDNA tag fragments (with adaptor primers and AE and TE recognition sites attached) are ligated, sandwiching the two tag sequences together, and flanking adaptors A and B at either end. These new constructs, called ditags, are then PCR amplified using anchor A and B specific primers.
The ditags are then cleaved using the original AE, and allowed to link together with other ditags, which will be ligated to create a cDNA concatemer with each ditag being separated by the AE recognition site.
These concatemers are then transformed into bacteria for amplification through bacterial replication.
The cDNA concatemers can then be isolated and sequenced using modern high-throughput DNA sequencers, and these sequences can be analysed with computer programs which quantify the recurrence of individual tags.

Analysis

The output of SAGE is a list of short sequence tags and the number of times it is observed. Using sequence databases a researcher can usually determine, with some confidence, from which original mRNA (and therefore which gene) the tag was extracted.

Statistical methods can be applied to tag and count lists from different samples in order to determine which genes are more highly expressed. For example, a normal tissue sample can be compared against a corresponding tumor to determine which genes tend to be more (or less) active.

History

In 1979 teams at Harvard and Caltech extended the basic idea of making DNA copies of mRNAs in vitro to amplifying a library of such in bacterial plasmids. In 1982–1983, the idea of selecting random or semi-random clones from such a cDNA library for sequencing was explored by Greg Sutcliffe and coworkers. and Putney et al. who sequenced 178 clones from a rabbit muscle cDNA library. In 1991 Adams and co-workers coined the term expressed sequence tag (EST) and initiated more systematic sequencing of cDNAs as a project (starting with 600 brain cDNAs). The identification of ESTs proceeded rapidly, millions of ESTs now available in public databases (e.g. GenBank).

In 1995, the idea of reducing the tag length from 100 to 800 bp down to tag length of 10 to 22 bp helped reduce the cost of mRNA surveys. In this year, the original SAGE protocol was published by Victor Velculescu at the Oncology Center of Johns Hopkins University. Robust LongSage (RL-SAGE) Further improved on the LongSAGE protocol with the ability to generate a library with an insert size of 50 ng mRNA, much smaller than previous LongSAGE insert size of 2 μg mRNA

SuperSAGE

SuperSAGE is a derivative of SAGE that uses the type III-endonuclease EcoP15I of phage P1, to cut 26 bp long sequence tags from each transcript's cDNA, expanding the tag-size by at least 6 bp as compared to the predecessor techniques SAGE and LongSAGE. The longer tag-size allows for a more precise allocation of the tag to the corresponding transcript, because each additional base increases the precision of the annotation considerably.

Like in the original SAGE protocol, so-called ditags are formed, using blunt-ended tags. However, SuperSAGE avoids the bias observed during the less random LongSAGE 20 bp ditag-ligation. By direct sequencing with high-throughput sequencing techniques (next-generation sequencing, i.e. pyrosequencing), hundred thousands or millions of tags can be analyzed simultaneously, producing very precise and quantitative gene expression profiles. Therefore, tag-based gene expression profiling also called "digital gene expression profiling" (DGE) can today provide most accurate transcription profiles that overcome the limitations of microarrays.

3'end mRNA sequencing, massive analysis of cDNA ends

In the mid 2010s several techniques combined with Next Generation Sequencing were developed that employ the "tag" principle for "digital gene expression profiling" but without the use of the tagging enzyme. The "MACE" approach, (=Massive Analysis of cDNA Ends) generates tags somewhere in the last 1500 bps of a transcript. The technique does not depend on restriction enzymes anymore and thereby circumvents bias that is related to the absence or location of the restriction site within the cDNA. Instead, the cDNA is randomly fragmented and the 3'ends are sequenced from the 5' end of the cDNA molecule that carries the poly-A tail. The sequencing length of the tag can be freely chosen. Because of this, the tags can be assembled into contigs and the annotation of the tags can be drastically improved. Therefore, MACE is also use for the analyses of non-model organisms. In addition, the longer contigs can be screened for polymorphisms. As UTRs show a large number of polymorphisms between individuals, the MACE approach can be applied for allele determination, allele specific gene expression profiling and the search for molecular markers for breeding. In addition, the approach allows determining alternative polyadenylation of the transcripts. Because MACE does only require 3’ ends of transcripts, even partly degraded RNA can be analyzed with less degradation dependent bias. The MACE approach uses unique molecular identifiers to allow for identification of PCR bias.