The International Chemical Identifier (InChI, <small>pronounced</small> ) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. Initially developed by the International Union of Pure and Applied Chemistry (IUPAC) and National Institute of Standards and Technology (NIST) from 2000 to 2005, the format and algorithms are non-proprietary. Since May 2009, it has been developed by the InChI Trust, a nonprofit charity from the United Kingdom which works to implement and promote the use of InChI.

The identifiers describe chemical substances in terms of layers of information &mdash; the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information. Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application. The InChI algorithm converts input structural information into a unique InChI identifier in a three-step process: normalization (to remove redundant information), canonicalization (to generate a unique number label for each atom), and serialization (to give a string of characters).

InChIs differ from the widely used CAS registry numbers in three respects: firstly, they are freely usable and non-proprietary; secondly, they can be computed from structural information and do not have to be assigned by some organization; and thirdly, most of the information in an InChI is human readable (with practice). InChIs can thus be seen as akin to a general and extremely formalized version of IUPAC names. They can express more information than the simpler SMILES notation and, in contrast to SMILES strings, every structure has a unique InChI string, which is important in database applications. Information about the 3-dimensional coordinates of atoms is not represented in InChI; for this purpose a format such as PDB can be used.

The InChIKey, sometimes referred to as a hashed InChI, is a fixed length (27 character) condensed digital representation of the InChI that is not human-understandable. The InChIKey specification was released in September 2007 in order to facilitate web searches for chemical compounds, since these were problematic with the full-length InChI. Unlike the InChI, the InChIKey is not unique: though collisions are expected to be extremely rare, there are known collisions.

InChI was first released in 2005. A major milestone was version 1.02 of January 2009, which provided a means to generate so called standard InChI, a version of the InChI with a fixed level of detail and collection of layers. The standard InChIKey is then the hashed version of the standard InChI string. The standard InChI will simplify comparison of InChI strings and keys generated by different groups, and subsequently accessed via diverse sources such as databases and web resources. Since version 1.07.1 (August 2024), the software uses the MIT license, and may be downloaded from the InChI GitHub site. Beside the implementation in molecule editors, stand-alone executables have been packaged for multiple Linux distributions, including Debian.

Generation

In order to avoid generating different InChIs for tautomeric structures, before generating the InChI, an input chemical structure is normalized to reduce it to its so-called core parent structure. This may involve changing bond orders, rearranging formal charges and possibly adding and removing protons. Different input structures may give the same result; for example, acetic acid and acetate would both give the same core parent structure, that of acetic acid. A core parent structure may be disconnected, consisting of more than one component, in which case the

sublayers in the InChI usually consist of sublayers for each component, separated by semicolons (periods for the chemical formula sublayer). One way this can happen is that all metal atoms are disconnected during normalization; so, for example, the InChI for tetraethyllead will have five components, one for lead and four for the ethyl groups.

  1. Main layer (always present)
  2. Chemical formula (no prefix). This is the only sublayer that must occur in every InChI. Numbers used throughout the InChI are given in the formula's element order excluding hydrogen atoms. For example, /C10H16N5O13P3 implies that atoms numbered 1–10 are carbons, 11–15 are nitrogens, 16–28 are oxygens, and 29–31 are phosphorus.
  3. Atom connections (<code>/c</code>). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones. The type of those bonds is later specified in the stereochemical layer (<code>/b</code>).
  4. Hydrogen atoms (<code>/h</code>). Describes how many hydrogen atoms are connected to each of the other atoms.
  5. Charge layer
  6. charge sublayer (<code>/q</code>)
  7. proton sublayer (<code>/p</code> for protons)
  8. Stereochemical layer
  9. double bonds and cumulenes (<code>/b</code>).
  10. tetrahedral stereochemistry of atoms and allenes. First <code>/t</code> describes the relative configuration, which implies a preference for one of the mirror forms. Then <code>/m</code> is used to choose whether to mirror the molecule described by <code>/t</code>, if an absolute configuration is requested.
  11. type of stereochemistry information (<code>/s</code>). <code>/s1</code> for absolute, <code>/s2</code> for relative (unspecified mix of chiralities), <code>/s3</code> for racemic (equal mix of both chiralities).
  12. Isotopic layer (<code>/i</code>), may include sublayers:

The InChIKey currently consists of three parts separated by hyphens, of 14, 10 and one character(s), respectively, like <code>xxxxxxxxxxxxxx-yyyyyyyyfv-p</code>.

| style=font-family:monospace |

  • BQJCRHHNABKAKU-KBQPJGBKSA-N
  • BQJCRHHNABKAKU-KBQPJGBKNA-N

| Standard.

|-

| H[<sup>2</sup>2H]O

| Semiheavy water

| style=font-family:monospace | InChI=1S/H2O/h1H2/i/hD

|

  • XLYOFNOQVPJJNP-DYCDLGHISA-N
  • XLYOFNOQVPJJNP-DYCDLGHINA-N

| Isotopic information is part of the standard.

|-

| [<sup>2</sup>2H]<sub>2</sub>O

| Heavy water

| style=font-family:monospace | InChI=1S/H2O/h1H2/i/hD2

|

  • XLYOFNOQVPJJNP-ZSJDYOACSA-N
  • XLYOFNOQVPJJNP-ZSJDYOACNA-N

| <code>D2</code> for two deuteriums.

|-

| [<sup>3</sup>2H]<sub>2</sub>O

| Superheavy water

| style=font-family:monospace | InChI=1S/H2O/h1H2/i/hT2

|

  • XLYOFNOQVPJJNP-PWCQTSIFSA-N
  • XLYOFNOQVPJJNP-PWCQTSIFSA-N

| <code>T</code> for tritium.

|-

| H<sub>2</sub>[<sup>18</sup>O]

| Heavy-oxygen water

| style=font-family:monospace | InChI=1S/H2O/h1H2/i1+2

|

  • XLYOFNOQVPJJNP-NJFSPNSNSA-N
  • XLYOFNOQVPJJNP-NJFSPNSNNA-N

| <code>/i1+2</code> means the atom number 1 is of an isotope with 2 more atomic mass than the normal one (oxygen-16).

|}

Base 26 encoding

InChIKey uses a base 26 encoding to represent (parts of) SHA-256 hashes. Input is chopped in 14-bit segments, each of which corresponds to three letters (triplets). A remaining group up to 9 bits correspond to 2 characters (doublets). In InChIKey, inputs can only be of two lengths: 65 bits for the "major" hash (divided into 14 × 4 + 9 bits for 3 × 4 + 2 = 14 characters) and 37 bits for the "minor" hash (14 × 2 + 9 bits for 3 × 2 + 2 = 8 characters). A few additional lengths are used in RInChI:

  • 28 (14 × 2) bits yield a 6-character hash; only the truncated 4-character form is used.
  • 56 (14 × 4) bits yield a 12-character hash, the truncated form being 10 characters.
  • 78 (65 + 14 - 1) bits yield a 17-character hash, with one bit used twice.

The first 80 bits of the SHA-256 for an empty string is <code>e3 b0 c4 42 98 fc 1c 14 9a fb</code>. This results in the following base26 strings for this hash: <code>UHFF</code>, <code>UHFFFAOY</code>, <code>UHFFFADPSC</code>, <code>UHFFFADPSCTJ</code>, <code>UHFFFADPSCTJAU</code>, <code>UHFFFADPSCTJAUYIS</code>.

AuxInfo

The auxiliary information (<code>AuxInfo</code>) string is produced by InChI software alongside the InChI string. For example, the (±)-borneol <code>/s2</code> example produces:

AuxInfo=1/0/N:1,2,3,4,5,6,7,8,9,10,11/E:(1,2)/rA:13cCCCCCCCCCCOHH/rB:;;;s4;;s4s6;s6;s1s2s7;n3s5s8s9;P8;P7;s8;/rC:2.0857,-1.1788,0;3.0905,.273,0;2.6864,-1.7772,0;4.5619,-2.283,0;3.6719,-2.2295,0;5.2528,-.9411,0;4.5862,-1.4963,0;4.4381,-.864,0;3.0628,-.7814,0;3.6539,-1.3571,0;3.6343,-.1809,0;5.5343,-1.9585,0;4.8482,.1078,0;

"AuxInfo contains, in particular, atom non-stereo equivalence information, mapping input atom positions to output positions, and 'reversibility' information for re-drawing the structure." The reversibility information can be used to regenerate the source structure (such as a MOLFILE with 2D or 3D coordinates) without needing an InChI. The InChI user guide describes the format in detail. The parts seen here are:

  • <code>1/0</code> refers to InChI version 1, normalization type 0.
  • <code>/N:</code> maps InChI's atom numbering to the input's atom numbering.
  • <code>/E:</code> describes the equivalence between atoms.
  • <code>/rA:</code> describes reversibility information for atoms.
  • <code>/rB:</code> describes reversibility information for bonds.
  • <code>/rC:</code> describes reversibility information for coordinates. Here 2D coordinates are used; a more realistic depiction for this molecule would be 3D.

The full complement of tags are: <code>1/0/N/E/gE/it/iN/I/E/gE/it/iN/CRV/rA/rB/rC</code>.

Derived formats

RInChI

RInChI (Reaction InChI, International chemical identifier for reactions) is a standard method for using InChI to describe chemical reactions. An RInChI string consists of several sets of InChI strings for the reactants, products, and agents as well as information required to tag them as such. Example string and breakdown:

{|class=wikitable

|+Example RInChI

!Part !! Layer # !! Description

|-

|font-family:monospace|RInChI=1.00.1S/

|1

|Version of RInChI (1.00), version of InChI used within (1S, verson 1 standard)

|-

|font-family:monospace|C2H4O2/c1-2(3)4/h1H3,(H,3,4)!C2H6O/c1-2-3/h3H,2H2,1H3&lt;&gt;

|2

| Left side of reaction (acetic acid and ethanol), version 1 standard InChI without the <code>InChI=1S/</code> header separated by <code>!</code>

|-

|font-family:monospace|C4H8O2/c1-3-6-4(2)5/h3H2,1-2H3!H2O/h1H2&lt;&gt;

|3

| Right side of reaction (ethyl acetate and water), same format

|-

|font-family:monospace|H2O4S/c1-5(2,3)4/h(H2,1,2,3,4)/

| 4

| Agents (sulfuric acid), same format

|-

|font-family:monospace|d=

|5

|Direction of reaction (<code>d</code>). <code>d=</code> means equilibrium, <code>d+</code> means left to right, <code>d-</code> means right to left.

|}

As shown above, layers that do not involve InChI parts are separated with <code>/</code> as in InChI. Layers that do are separated with <code>&lt;&gt;</code>. Multiple InChI parts are separated with <code>!</code>.

An example of a relatively complex (nested) Mixfile is provided below.

<syntaxhighlight lang=json>

{

"mixfileVersion": 1,

"name": "37% wt. Formaldehyde in Water with 10-15% Methanol",

"contents": [

{

"contents": [

{

"name": "formaldehyde",

"quantity": 37,

"units": "w/w%",

"inchi": "InChI=1S/CH2O/c1-2/h1H2",

},

{

"name": "water",

"inchi": "InChI=1S/H2O/h1H2",

}

]

},

{

"name": "methanol",

"quantity": [10, 15],

"units": "%",

"inchi": "InChI=1S/CH4O/c1-2/h2H,1H3",

}

]

}

</syntaxhighlight>

The corresponding MInChI is: <code>MInChI=0.00.1S/CH2O/c1-2/h1H2&CH4O/c1-2/h2H,1H3&H2O/h1H2/n