<!-- GenBank statistics sourced from gbrel.txt (updated every release):

https://ftp.ncbi.nih.gov/genbank/gbrel.txt

Release-specific notes follow pattern:

https://ftp.ncbi.nih.gov/genbank/release.notes/gbXXX.release.notes

-->

The GenBank sequence database is an open access, annotated collection of publicly available nucleotide sequences and their protein translations. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) and is produced and maintained by the National Center for Biotechnology Information (NCBI), a division of the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH).

As of GenBank release 271.0 (April 2026), the database contained 53.90 trillion bases and 6.27 billion sequence records, including 261,460,182 GenBank entries containing 7,289,942,983,522 base pairs of sequence data. The database includes sequences from more than 581,000 formally described species.

The database was established in 1982 by Walter Goad and the Los Alamos National Laboratory and has become a central resource for biological research. GenBank is built from direct submissions by individual laboratories as well as bulk submissions from large-scale sequencing projects., which provides guided workflows for data submission, or programmatically using tools such as table2asn. The legacy BankIt submission system is being phased out in favor of the Submission Portal.

Upon receipt of a submission, GenBank staff review the data for completeness, biological context, and consistency, assign an accession number, and perform quality assurance checks before release to the public database. Submitted sequences are accessible through Entrez and are available for download via FTP.

GenBank supports a variety of submission types, including whole genome shotgun (WGS) assemblies, transcriptome shotgun assemblies (TSA), targeted locus studies (TLS), and high-throughput genomic (HTGS) sequences. Third Party Annotation (TPA) records allow the publication of annotations based on sequences already present in GenBank. Raw sequence reads generated by next-generation sequencing technologies are deposited in the Sequence Read Archive (SRA), rather than in GenBank itself.

History

Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory (LANL) and colleagues established the Los Alamos Sequence Database in 1979, which culminated in the creation of the public GenBank database in 1982. Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences had been collected. An early description of the database was published in 1985, when GenBank contained over five million bases across approximately 6,000 sequence entries.

During the late 1980s and early 1990s, responsibility for GenBank transitioned from Los Alamos National Laboratory to the newly established National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH). Contemporary GenBank release documentation from this period reflects a shift from joint contributions by LANL-based database staff and NLM-based indexing teams toward full management by NCBI, indicating a phased transfer of data curation and operational responsibilities.

300px|thumb|GenBank and EMBL: NucleotideSequences 1986/1987 Volumes I to VII.

300px|thumb|CD-ROM of GenBank v100

Growth

thumb|Growth in GenBank base pairs, 1982 to 2018, on a [[semilog graph|semi-log scale]]

GenBank has grown substantially since its inception. Early analyses and release notes have described this growth as approximating a doubling in the number of bases every 18 months, although growth rates have varied over time with changes in sequencing technologies and data submission practices. showed that such analyses were more discriminative when GenBank was combined with other services such as EzTaxon-e and the BIBI databases.

GenBank may contain sequences wrongly assigned to a particular species, because the initial identification of the organism was wrong. A recent study showed that 75% of mitochondrial Cytochrome c oxidase subunit I sequences were wrongly assigned to the fish Nemipterus mesoprion resulting from continued usage of sequences of initially misidentified individuals. The authors provide recommendations how to avoid further distribution of publicly available sequences with incorrect scientific names.

Numerous published manuscripts have identified erroneous sequences on GenBank. These are not only incorrect species assignments (which can have different causes) but also include chimeras and accession records with sequencing errors. A recent manuscript on the quality of all Cytochrome b records of birds further showed that 45% of the identified erroneous records lack a voucher specimen that prevents a reassessment of the species identification.

Another problem is that sequence records are often submitted as anonymous sequences without species names (e.g. as "Pelomedusa sp. A CK-2014" because the species are either unknown or withheld for publication purposes. However, even after the species have been identified or published, these sequence records are not updated and thus may cause ongoing confusion.

See also

  • International Nucleotide Sequence Database Collaboration (INSDC)
  • European Nucleotide Archive (ENA)
  • DNA Data Bank of Japan (DDBJ)
  • RefSeq — curated reference sequences derived from GenBank
  • UniProt — protein sequence and annotation database
  • Sequence analysis
  • Open science data

Notes

References

  • GenBank homepage
  • Example GenBank record (hemoglobin beta)
  • Entrez Programming Utilities (E-utilities)
  • GenBank, RefSeq, TPA and UniProt: What's in a Name?