Digraphs and trigraphs (programming)

In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters.

Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as <code>{</code> and <code>}</code>.

History

The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support one or more of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any national version of the ISO 646 character set.

With the widespread adoption of ASCII and Unicode/UTF-8, trigraph use is limited today, and trigraph support has been removed from C as of C23.

Implementations

Trigraphs are not commonly encountered outside compiler test suites. To safely place two consecutive question marks within a string literal, the programmer can use string concatenation <code>"...?""?..."</code> or an escape sequence <code>"...?\?..."</code>.

<code>???</code> is not itself a trigraph sequence, but when followed by a character such as <code>-</code> it will be interpreted as <code>?</code> + <code>??-</code>, which becomes <code>?~</code>.

The <code>??/</code> trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:

which is a single logical comment line (used in C++ and C99), and

which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed.

{|class="wikitable floatright" style="margin-left: 1.5em;"

|+ Alternative digraphs introduced in the C standard in 1994

! Digraph !! Equivalent

| <code><:</code> || <code>[</code>

| <code>:></code> || <code>]</code>

| <code><%</code> || <code>{</code>

| <code>%></code> || <code>}</code>

| <code>%:</code> || <code>#</code>

In 1994, a normative amendment to the C standard, C95, included in C99, supplied digraphs as more readable alternatives to five of the trigraphs.

Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token <code>%:%:</code> replacing the preprocessor concatenation token <code>##</code>. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.

C++

C++ (through C++14, see below) behaves like C, including the C99 additions.

History

Implementations

C++

External links