ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41
ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes (0x00–1F and 0x7F–9F) to be used for non-printing control codes
- A format for encoding these sets, assuming that 8 bits are available per byte,
- A format for encoding these sets in the same encoding system when only 7 bits are available per byte, and a method for transforming any conformant character data to pass through such a 7-bit environment,
- The general structure of ANSI escape codes, and
- Specific escape code formats for identifying individual character sets, for announcing the use of particular encoding features or subsets, and for interacting with or switching to other encoding systems. In particular, 7-bit encoding systems using ISO/IEC 2022 mechanisms include ISO-2022-JP (or JIS encoding), which has primarily been used in Japanese-language e-mail. 8-bit encoding systems conforming to ISO/IEC 2022 include ISO/IEC 4873 (ECMA-43), which is in turn conformed to by ISO/IEC 8859, More specialised applications of ISO 2022 include the MARC-8 encoding system used in MARC 21 library records. The escape sequences do not only declare which character set is being used, but also whether the set is single-byte or multi-byte (although not how many bytes it uses if it is multi-byte), and also whether each byte has 94 or 96 permitted values.
Code structure
Notation and nomenclature
ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. Escape sequences allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.
Encoding byte values ("bit combinations") are often given in column-line notation, where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash. Hence, for instance, codes 2/0 (0x20) through 2/15 (0x2F) inclusive may be referred to as "column 02". This is the notation used in the ISO/IEC 2022 / ECMA-35 standard itself. They may be described elsewhere using hexadecimal, as is often used in this article, or using the corresponding ASCII characters, although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.
Byte values from the 7-bit ASCII graphic range (hexadecimal 0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes (with "GL" standing for "graphics left") while bytes from the "high ASCII" range (0xA0–0xFF), if available (i.e. in an 8-bit environment), are referred to as the "GR" codes ("graphics right"). The terms "CL" (0x00–0x1F) and "CR" (0x80–0x9F) are defined for the control ranges, but the CL range always invokes the primary (C0) controls, whereas the CR range always either invokes the secondary (C1) controls or is unused. and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of whitespace character may be.
General syntax of escape sequences
Sequences using the ESC (escape) character take the form <code>ESC [...] </code>, where the ESC character is followed by zero or more intermediate bytes () from the range 0x20–0x2F, and one final byte () from the range 0x30–0x7E.
The first byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.
Control functions from some sets may make use of further bytes following the escape sequence proper. For example, the ISO 6429 control function "", which can be represented using an escape sequence, is followed by zero or more bytes in the range 0x30–0x3F, then zero or more bytes in the range 0x20–0x2F, then by a single byte in the range 0x40–0x7E, the entire sequence being called a "control sequence".
Graphical character sets
Each of the four working sets G0 through G3 may be a 94-character set or a 94<sup>n</sup>-character multi-byte set. Additionally, G1 through G3 may be a 96- or 96<sup>n</sup>-character set.
In a 96- or 96<sup>n</sup>-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by the set. In a 94- or 94<sup>n</sup>-character set, the bytes 0x20 and 0x7F are not used. the box drawing set from ISO/IEC 10367, and ISO-IR-164 (a subset of the G1 set of ISO-8859-8 with only the letters, used by CCITT).
Combining characters
Characters are expected to be spacing characters, not combining characters, unless specified otherwise by the graphical set in question. ISO 2022 / ECMA-35 also recognizes the use of the backspace and carriage return control characters as means of combining otherwise spacing characters, as well as the CSI sequence "Graphic Character Combination" (GCC)
Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43 and by ISO/IEC 8859, on the basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning.
Control character sets
Control character sets are classified as "primary" or "secondary" control code sets, respectively also called "C0" and "C1" control code sets.
A C0 control set must contain the ESC (escape) control character at 0x1B (a C0 set containing only ESC is registered as ISO-IR-104), whereas a C1 control set may not contain the escape control whatsoever. Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set. or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard. whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F (in an 8-bit environment) or by using escape sequences (in a 7-bit or 8-bit environment), For example, ISO/IEC 4873 specifies CR bytes for the C1 controls which it uses (SS2 and SS3). If necessary, which invocation is used may be communicated using announcer sequences.
In the latter case, single control functions from the C1 control code set are invoked using "type Fe" escape sequences,
Other control functions
Additional control functions are assigned to "type Fs" escape sequences (in the range <code>ESC 0x60 (`)</code> through <code>ESC 0x7E (~)</code>); these have permanently assigned meanings rather than depending on the C0 or C1 designations. Registration of control functions to type "Fs" sequences must be approved by ISO/IEC JTC 1/SC 2. although no "3Ft" sequences are currently assigned (as of 2019). Some of these are specified in ECMA-35 (ISO 2022 / ANSI X3.41), others in ECMA-48 (ISO 6429 / ANSI X3.64). ECMA-48 refers to these as "independent control functions".
{| class="wikitable"
|-
! Code !! Hex !! Abbr. !! Name !! Effect Connections to clients are unaffected.
|-
| <code>ESC d</code> || <code>1B 64</code> || CMD || Coding method delimiter || Used when interacting with an outer coding / representation system, see below.
|-
| <code>ESC n</code> || <code>1B 6E</code> || LS2 || Locking shift two || Shift function, see below.
|-
| <code>ESC o</code> || <code>1B 6F</code> || LS3 || Locking shift three || Shift function, see below.
|-
| <code>ESC |</code> || <code>1B 7C</code> || LS3R || Locking shift three right || Shift function, see below.
|-
| <code>ESC }</code> || <code>1B 7D</code> || LS2R || Locking shift two right || Shift function, see below.
|-
| <code>ESC ~</code> || <code>1B 7E</code> || LS1R || Locking shift one right || Shift function, see below.
|}
Escape sequences of type "Fp" (<code>ESC 0x30 (0)</code> through <code>ESC 0x3F (?)</code>) or of type "3Fp" (<code>ESC 0x23 (#) [...] 0x30 (0)</code> through <code>ESC 0x23 (#) [...] 0x3F (?)</code>) are reserved for single private use control codes, by prior agreement between parties. Several such sequences of both types are used by DEC terminals such as the VT100, and are thus supported by terminal emulators.
An 8-bit code may have GR codes specifying G1 characters, i.e. with its corresponding 7-bit code using Shift In and Shift Out to switch between the sets (e.g. JIS X 0201), although some instead have GR codes specifying G2 characters, with the corresponding 7-bit code using a single-shift code to access the second set (e.g. T.51).
The codes shown in the table below are the most common encodings of these control codes, conforming to ISO/IEC 6429. The LS2, LS3, LS1R, LS2R and LS3R shifts are registered as single control functions and are always encoded as the escape sequences listed below, This coding is currently recommended by ISO/IEC 2022 / ECMA-35 for applications requiring 7-bit single-byte representations of SS2 and SS3, and may also be used for SS2 only, although older code sets with SS2 at 0x1C also exist, and were mentioned as such in an earlier edition of the standard. The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for ISO/IEC 4873 levels 2 and 3.
{| class="wikitable"
|-
! Code !! Hex !! Abbr. !! Name !! Effect
|- id="SI"
| <code>SI</code> || <code>0F</code> || SI<br>LS0 || Shift In<br>Locking shift zero || GL encodes G0 from now on
|- id="SO"
| <code>SO</code> || <code>0E</code> || SO<br>LS1 || Shift Out<br>Locking shift one || GL encodes G1 from now on
|- id="LS2R"
| <code>ESC }</code> || <code>1B 7D</code> || LS2R || Locking shift two right || GR encodes G2 from now on For instance, ISO/IEC 4873 specifies GL, whereas packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area. If necessary, which single-shift area is used may be communicated using announcer sequences.
The names "locking shift zero" (LS0) and "locking shift one" (LS1) refer to the same pair of C0 control characters (0x0F and 0x0E) as the names "shift in" (SI) and "shift out" (SO). However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments.
Registration of graphical and control code sets
The ISO International register of coded character sets to be used with escape sequences (ISO-IR) lists graphical character sets, control code sets, single control codes and so forth which have been registered for use with ISO/IEC 2022. The procedure for registering codes and sets with the ISO-IR registry is specified by ISO/IEC 2375. Each registration receives a unique escape sequence, and a unique registry entry number to identify it. For example, the CCITT character set for Simplified Chinese is known as ISO-IR-165.
Registration of coded character sets with the ISO-IR registry identifies the documents specifying the character set or control function associated with an ISO/IEC 2022 non‑private-use escape sequence. This may be a standard document; however, registration does not create a new ISO standard, does not commit the ISO or IEC to adopt it as an international standard, and does not commit the ISO or IEC to add any of its characters to the Universal Coded Character Set.
ISO-IR registered escape sequences are also used encapsulated in a Formal Public Identifier to identify character sets used for numeric character references in SGML (ISO 8879). For example, the string can be used to identify the International Reference Version of ISO 646-1983, and the HTML 4.01 specification uses to identify Unicode. The textual representation of the escape sequence, included in the third element of the FPI, will be recognised by SGML implementations for supported character sets. At the other extreme, no multibyte 96-sets have been registered, so the sequences below are strictly theoretical.
As with other escape sequence types, the range 0x30–0x3F is reserved for private-use bytes, or MARC-8, However, in a graphical set designation sequence, if the second byte (for a single-byte set) or the third byte (for a double-byte set) is 0x20 (space), the set denoted is a "dynamically redefinable character set" (DRCS) defined by prior agreement, which is also considered private use. The manner in which DRCS sets and associated fonts are transmitted, allocated and managed is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with byte 0x40 (<code>@</code>); however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext.
There are also three special cases for multi-byte codes. The code sequences <code>ESC $ @</code>, <code>ESC $ A</code>, and <code>ESC $ B</code> were all registered when the contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted in place of the sequences <code>ESC $ ( @</code> through <code>ESC $ ( B</code> to designate to the G0 character set.
There are additional (rarely used) features for switching control character sets, but this is a single-level lookup, in that (as noted above) the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences (as opposed to the graphical set ones) may also be used from within ISO/IEC 10646 (UCS/Unicode), in contexts where processing ANSI escape codes is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding.
A table of escape sequence bytes and the designation or other function which they perform is below.
{| class="wikitable"
|-
! Code !! Hex !! Abbr. !! Name !! Effect !! Example
|-
| <code>ESC SP </code> || <code>1B 20 </code> || ACS || Announce code structure || Specifies code features used, e.g. working sets (see below). || <code>ESC SP L</code> <br/>(ISO 4873 level 1)
|- id="CZD"
| <code>ESC ! </code> || <code>1B 21 </code> || CZD || C0-designate || selects a C0 control character set to be used. || <code>ESC ! @</code> <br/>(ASCII C0 codes)
|- id="C1D"
| <code>ESC " </code> || <code>1B 22 </code> || C1D || C1-designate || selects a C1 control character set to be used. || <code>ESC " C</code> <br/>(ISO 6429 C1 codes)
|-
| <code>ESC # </code> || <code>1B 23 </code> || - || (Single control function) || (Reserved for sequences for control functions, see above.) || <code>ESC # 6</code> <br/>(private use: DEC Double Width Line)
|- id="GZDM4"
|
|
| GZDM4 || G0-designate multibyte 94-set || selects a 94<sup>n</sup>-character set to be used for G0. || <code>ESC & @ ESC $ B</code> <br/>(JIS X 0208:1990 in G0)
|-
| <code>ESC ' </code> || <code>1B 27 </code> || - || (not used) || (not used) || -
|- id="GZD4"
| <code>ESC ( </code> || <code>1B 28 </code> || GZD4 || G0-designate 94-set || selects a 94-character set to be used for G0. Since 96-character sets cannot be designated to G0, this first byte is not used by the current edition of the standard. However, it is still listed by MARC-8. || -
|- id="G1D6"
| <code>ESC - </code> || <code>1B 2D </code> || G1D6 || G1-designate 96-set || selects a 96-character set to be used for G1.
|-
| <code>ESC % </code> || <code>1B 25 </code> || Designate other coding system ("with standard return") || selects an 8-bit code; use <code>ESC % @</code> to return. || selects an 8-bit code; there is no standard way to return.
|-
|}
Of particular interest are the sequences which switch to ISO/IEC 10646 (Unicode) formats which do not follow the ISO/IEC 2022 structure. These include UTF-8 (which does not reserve the range 0x80–0x9F for control characters), its predecessor UTF-1 (which mixes GR and GL bytes in multi-byte codes), and UTF-16 and UTF-32 (which use wider coding units).
{|class=wikitable
|-
!Unicode Format!!Code(s)!!Hex!!Deprecated codes!!Deprecated hex <br /><code>1B 25 2F 49</code>||<code>ESC % / G</code>, <br /><code>ESC % / H</code>||<code>1B 25 2F 47</code>, <br /><code>1B 25 2F 48</code>
|-
|UTF-16||<code>ESC % / L</code>||<code>1B 25 2F 4C</code>||<code>ESC % / @</code>, <br /><code>ESC % / C</code>, <br /><code>ESC % / E</code>, <br /><code>ESC % / J</code>, <br /><code>ESC % / K</code>||<code>1B 25 2F 40</code>, <br /><code>1B 25 2F 43</code>, <br /><code>1B 25 2F 45</code>, <br /><code>1B 25 2F 4A</code>, <br /><code>1B 25 2F 4B</code>
|-
|UTF-32||<code>ESC % / F</code>||<code>1B 25 2F 46</code>||<code>ESC % / A</code>, <br /><code>ESC % / D</code>||<code>1B 25 2F 41</code>, <br /><code>1B 25 2F 44</code>
|}
Of the sequences switching to UTF-8, <code>ESC % G</code> is the one supported by, for example, xterm.
Although use of a variant of the standard return sequence from UTF-16 and UTF-32 is permitted, the bytes of the escape sequence must be padded to the size of the code unit of the encoding (i.e. <code>001B 0025 0040</code> for UTF-16), i.e. the coding of the standard return sequence does not conform exactly to ISO/IEC 2022. For this reason, the designations for UTF-16 and UTF-32 use a without-standard-return syntax.
For specifying encodings by labels, the X Consortium's Compound Text format defines five private-use DOCS sequences. A number of other variants are defined by vendors, including IBM. It has an advantage over other encodings for Japanese in that it does not require 8-bit clean transmission. Microsoft calls it Code page 50220. It starts in ASCII and includes the following escape sequences:
- <code>ESC ( B</code> to switch to ASCII (1 byte per character)
- <code>ESC ( J</code> to switch to JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)
- <code>ESC $ @</code> to switch to JIS X 0208-1978 (2 bytes per character)
- <code>ESC $ B</code> to switch to JIS X 0208-1983 (2 bytes per character)
Use of the two characters added in JIS X 0208-1990 is permitted, but without including the IRR sequence, i.e. using the same escape sequence as JIS X 0208-1983. The RFC also notes that some past systems had made erroneous use of the sequence <code>ESC ( H</code> to switch away from JIS X 0208, which is actually registered for ISO-IR-11 (a Swedish variant of ISO 646 and World System Teletext). this is close in both name and structure to an encoding denoted ISO-2022-JPext by DEC, which furthermore adds a two-byte user-defined region accessed with <code>ESC $ ( 0</code> to complete the coverage of Super DEC Kanji. The WHATWG/HTML5 variant permits decoding JIS X 0201 katakana in ISO-2022-JP input, but converts the characters to their JIS X 0208 equivalents upon encoding. They are not widely used;
- <code>ESC $ A</code> to switch to GB 2312-1980 (2 bytes per character)
- <code>ESC $ ( C</code> to switch to KS X 1001-1992 (2 bytes per character)
- <code>ESC $ ( D</code> to switch to JIS X 0212-1990 (2 bytes per character)
- <code>ESC . A</code> to switch to ISO/IEC 8859-1 high part, Extended Latin 1 set (1 byte per character) [designated to G2]
- <code>ESC . F</code> to switch to ISO/IEC 8859-7 high part, Basic Greek set (1 byte per character) [designated to G2]
ISO-2022-JP with the ISO-2022-JP-2 representation of JIS X 0212, but not the other extensions, was subsequently dubbed ISO-2022-JP-1 by <nowiki>RFC 2237</nowiki>, dated 1997.
IBM Japanese TCP
IBM implements nine 7-bit ISO 2022 based encodings for Japanese, each using a different set of escape sequences: IBM-956, IBM-957, IBM-958, IBM-959, IBM-5052, IBM-5053, IBM-5054, IBM-5055 and ISO-2022-JP, which are collectively termed "TCP/IP Japanese coded character sets". CCSID 9148 is the standard (RFC 1468) ISO-2022-JP.
|-
| 956 || TCP-01 ||
|-
| 957 || TCP-02 ||
|-
| 958 || TCP-03 ||
|-
| 959 || TCP-04 ||
|-
| 5052 || TCP-05 ||
|-
| 5053 || TCP-06 ||
|-
| 5054 || TCP-07 ||
|-
| 5055 || TCP-08 ||
|-
| 9148 || TCP-16 ||
|}
JIS X 0213
The JIS X 0213 standard, first published in 2000, defines an updated version of ISO-2022-JP, without the ISO-2022-JP-2 extensions, named ISO-2022-JP-3. The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1, while the new plane 2 received its own registration. The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile, dubbed ISO-2022-JP-2004. In addition to the basic ISO-2022-JP designation codes, the following designations are recognized:
- <code>ESC ( I</code> to switch to JIS X 0201-1976 Kana set (1 byte per character)
- <code>ESC $ ( O</code> to switch to JIS X 0213-2000 Plane 1 (2 bytes per character)
- <code>ESC $ ( P</code> to switch to JIS X 0213-2000 Plane 2 (2 bytes per character)
- <code>ESC $ ( Q</code> to switch to JIS X 0213-2004 Plane 1 (2 bytes per character, ISO-2022-JP-2004 only)
Other 7-bit versions
is defined in <nowiki>RFC 1557</nowiki>, dated 1993. It encodes ASCII and the Korean double-byte KS X 1001-1992, previously named KS C 5601-1987. Unlike ISO-2022-JP-2, it makes use of the Shift Out and Shift In characters to switch between them, after including <code>ESC $ ) C</code> once at the start of a line to designate KS X 1001 to G1. They support the character sets GB 2312 (for simplified Chinese) and CNS 11643 (for traditional Chinese).
The basic ISO-2022-CN profile uses ASCII as its G0 (shift in) set, and also includes GB 2312 and the first two planes of CNS 11643 (due to these two planes being sufficient to represent all traditional Chinese characters from common Big5, to which the RFC provides a correspondence in an appendix): which maps all input to the replacement character (), in order to prevent certain cross-site scripting and related attacks, which utilize a difference in encoding support between the client and server. Although the same security concern (allowing sequences of ASCII bytes to be interpreted differently) also applies to ISO-2022-JP and UTF-16, they could not be given this treatment due to being much more frequently used in deployed content.
In April 2024, a security flaw was found in the implementation of ISO-2022-CN-EXT in glibc, which lead to recommendations to disable the encoding entirely on Linux systems.
ISO/IEC 4873
thumb|right|Relationship between ECMA-43 (ISO/IEC 4873) editions and levels, and [[#Extended Unix Code|EUC.]]
A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by Ecma International as ECMA-43. ISO/IEC 8859 defines 8-bit codes for ISO/IEC 4873 (or ECMA-43) level 1.
ISO/IEC 4873 / ECMA-43 defines three levels of encoding:
- Level 1, which includes a C0 set, the ASCII G0 set, an optional C1 set and an optional single-byte (94-character or 96-character) G1 set. G0 is invoked over GL, and G1 is invoked over GR. Use of shift functions is not permitted.
- Level 2, which includes a (94-character or 96-character) single-byte G2 and/or G3 set in addition to a mandatory G1 set. Only the single-shift functions SS2 and SS3 are permitted (i.e. locking shifts are forbidden), and they invoke over the GL region (including 0x20 and 0x7F in the case of a 96-set). SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively. This minimal required C1 set for ISO 4873 is registered as ISO-IR-105. For instance, the 8-bit encoding of JIS X 0201 is compliant with earlier editions. This was subsequently changed to fully specify the ISO/IEC 646:1991 IRV / ISO-IR No. 6 set (ASCII).
In cases where duplicate characters are available in different sets, the current edition of ISO/IEC 4873 / ECMA-43 only permits using these characters in the lowest numbered working set which they appear in. For instance, if a character appears in both the G1 set and the G3 set, it must be used from the G1 set. However, use from other sets is noted as having been permitted in earlier editions.
ISO/IEC 8859 defines complete encodings at level 1 of ISO/IEC 4873, and does not allow for use of multiple ISO/IEC 8859 parts together. It stipulates that ISO/IEC 10367 should be used instead for levels 2 and 3 of ISO/IEC 4873.
Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol, in which case the standard requires an ISO/IEC 2022 announcer sequence specifying the ISO/IEC 4873 level, followed by a complete set of escapes specifying the character set designations for C0, C1, G0, G1, G2 and G3 respectively (but omitting G2 and G3 designations for level 1), with an -byte of 0x7E denoting an empty set. Each ISO/IEC 4873 level has its own single ISO/IEC 2022 announcer sequence, which are as follows:
{| class="wikitable"
|-
! Code !! Hex !! Announcement
|-
| <code>ESC SP L</code> || <code>1B 20 4C</code> || ISO 4873 Level 1
|-
| <code>ESC SP M</code> || <code>1B 20 4D</code> || ISO 4873 Level 2
|-
| <code>ESC SP N</code> || <code>1B 20 4E</code> || ISO 4873 Level 3
|-
|}
Extended Unix Code
Extended Unix Code (EUC) is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese. It is based on ISO 2022, and only character sets which conform to the ISO 2022 structure can have EUC forms. Up to four coded character sets can be represented (in G0, G1, G2 and G3). The G0 set is invoked over GL, the G1 set is invoked over GR, and the G2 and G3 sets are (if present) invoked using the single shifts SS2 and SS3, which are used as CR bytes (i.e. 0x8E and 0x8F respectively) and invoke over GR (not GL).
The code assigned to the G0 set is ASCII, or the country's national ISO 646 character set such as KS-Roman (KS X 1003) or JIS-Roman (the lower half of JIS X 0201).
{|class=wikitable
|-
!Individual sequence!!Hexadecimal!!Feature of EUC denoted
|-
|<code>ESC SP C</code>||<code>1B 20 43</code>||ISO-8 (8-bit, G0 in GL, G1 in GR)
|-
|<code>ESC SP Z</code>||<code>1B 20 5A</code>||G2 accessed using SS2
|-
|<code>ESC SP [</code>||<code>1B 20 5B</code>||G3 accessed using SS3
|-
|<code>ESC SP \</code>||<code>1B 20 5C</code>||Single-shifts invoke over GR
|}
Compound Text (X11)
The X Consortium defined an ISO 2022 profile named Compound Text as an interchange format in 1989. This uses only four control codes: NL (newline, coded as , , and (in its 8-bit representation ), with the SDS CSI sequence being used for bidirectional text control. It is an 8-bit code using G0 and G1 for GL and GR, and follows ISO-8859-1 in its initial state. The following F-bytes are used:
{|class="wikitable collapsible"
|+ISO 2022 designation sequences used in X11 Compound Text
|-
!Escape sequence type!!Final byte!!Graphical set
|-
|rowspan=3|GZD4, G1D4 (for 94-character sets)|| ()||ASCII
|-
| ()||JIS X 0201 katakana
|-
| ()||JIS X 0201 Roman
|-
|rowspan=9|G1D6 (for 96-character sets)|| ()||ISO-8859-1 high part
|-
| ()||ISO-8859-2 high part
|-
| ()||ISO-8859-3 high part
|-
| ()||ISO-8859-4 high part
|-
| ()||ISO-8859-7 high part
|-
| ()||ISO-8859-6 high part
|-
| ()||ISO-8859-8 high part
|-
| ()||ISO-8859-5 high part
|-
| ()||ISO-8859-9 high part
|-
|rowspan=3|GZDM4, G1DM4 (for 2-byte sets)|| ()||GB 2312
|-
| ()||JIS X 0208
|-
| ()||KS C 5601
|}
For specifying encodings by labels, X11 Compound Text defines five private-use DOCS sequences: () for variable-length encodings, and through for fixed-length encodings using one through four bytes respectively. Rather than using another escape sequence to return to , the two bytes following the initial escape sequence specify the remaining length in bytes, coded in base-128 using bytes . The encoding label is included in ISO 8859-1 before the encoded text, and terminated with ().
Comparison with other encodings
Advantages
- As ISO/IEC 2022's entire range of graphical character encodings can be invoked over GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
- As compared to Unicode, ISO/IEC 2022 sidesteps Han unification by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues associated with unification, such as difficulty supporting multiple CJK languages with their associated character variants in a single document and font.
Disadvantages
- Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a backup to the previous escape sequence before the bytes following the escape sequence can be interpreted.
- Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be designated to any of G0 through G3, which may be invoked using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
- Some systems, like DICOM and several e-mail clients, use a variant of ISO-2022 (e.g. "ISO 2022 IR 100") in addition to supporting several other encodings. This type of variation makes it difficult to portably transfer text between computer systems.
- UTF-1, the multi-byte Unicode transformation format compatible with ISO/IEC 2022's representation of 8-bit control characters, has various disadvantages in comparison with UTF-8, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
- Because of its escape sequences, it is possible to construct attack byte sequences in which a malicious string (such as cross-site scripting) is masked until it is decoded to Unicode, which may allow it to bypass sanitisation. and 7-bit ISO 2022 data (except for ISO-2022-JP) is mapped in its entirety to the replacement character in HTML5 to prevent attacks. Implementing this measure, e.g. in Mozilla Thunderbird, has led to interoperability issues, with unexpected characters being generated where two ISO-2022-JP streams have been concatenated.
See also
- ISO 2709
- ISO/IEC 646
- ISO-IR-102
- C0 and C1 control codes
- CJK characters
- MARC standards
- Mojibake
- luit
- ISO/IEC JTC 1/SC 2
Footnotes
References
Standards and registry indices cited
Registered code sets cited
Internet Requests For Comment cited
Other published works cited
Further reading
External links
- ISO/IEC 2022:1994
- ISO/IEC 2022:1994/Cor 1:1999
- ECMA-35, equivalent to ISO/IEC 2022 and freely downloadable.
- International Register of Coded Character Sets to be Used with Escape Sequences, a full list of assigned character sets and their escape sequences
- History of Character Codes in North America, Europe, and East Asia from 1999, rev. 2004
- Ken Lunde's CJK.INF : a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO/IEC 2022.<!-- slightly older version (1.9 not 2.1) presumably uploaded by Lunde himself: https://blogs.adobe.com/CCJKType/files/2013/09/cjk_inf.txt -->
