thumb|262px|Schematic representation of the three top levels of the CATH classification scheme.

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones,

Hierarchical organization

Experimentally determined protein three-dimensional structures are obtained from the Protein Data Bank (PDB) and split into their consecutive polypeptide chains, where applicable. Protein domains are identified ("chopped") within these chains using a mixture of automatic methods and manual curation.

The domains are then classified within the CATH structural hierarchy: at the Class (C) level, domains are assigned according to their secondary structure content, i.e. all alpha, all beta, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution

  • The Encyclopedia of Domains (TED) applies the automated CATH methodlogy to 188 million unique structures from the AlphaFold Protein Structure Database, identifying nearly 365 million domains, which is 100 million more than what Gene3D could identify. Using structual comparison, 194 million domains were matched to the CATH database at the superfamily (H) level, with an extra 46 million matched to the topology (T) level. The remaining domains have structures totally new to CATH.

Releases

The CATH team releases new data both as daily snapshots, and official releases approximately annually. The latest release of CATH-Gene3D (v4.3) was released in December 2020 and consists of:

  • 500,238 structural protein domain entries
  • 151 mln non-structural protein domain entries
  • 5,481 homologous superfamily entries
  • 212,872 functional family entries

Open-source software

CATH is an open source software project, with developers developing and maintaining a number of open-source tools, which are available publicly on GitHub.

References