UTF-32 (32-bit Unicode Transformation Format), sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 2<sup>32</sup> Unicode code points, needing actually only 21 bits).
The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts (except, for example, in the case of texts with some popular emojis), and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of UTF-16. It can be up to four times the size of UTF-8 depending on how many of the characters are in the ASCII subset. Although the ISO standard had (as of 1998 in Unicode 2.1) "reserved for private use" 0xE00000 to 0xFFFFFF, and 0x60000000 to 0x7FFFFFFF these areas were removed in later versions. Because the Principles and Procedures document of ISO/IEC JTC 1/SC 2 Working Group 2 states that all future assignments of code points will be constrained to the Unicode range, UTF-32 will be able to represent all UCS code points and UTF-32 and UCS-4 are identical.
Utility of fixed width
A fixed number of bytes per code point has theoretical advantages, but each of these has problems in reality:
- Truncation becomes easier, but not significantly so compared to UTF-8 and UTF-16 (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).
- Finding the Nth "character" in a string. Finding the Nth code point is a O(1) problem, while it is O(n) problem in a variable-width encoding. However what a user might call a "character" is still variable-width, for instance the combining character sequence is two code points, the emoji is three, and the ligature is one.
- Quickly knowing the "width" of a string. However even "fixed width" fonts have varying width, often CJK ideographs are twice as wide,
Programming languages
Python versions up to 3.2 can be compiled to use UTF-32 strings instead of UTF-16; from version 3.3 onward, Unicode strings are stored in UTF-32 if there is at least 1 non-BMP character in the string, but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size. <!-- In previous versions of Python, "\U0001F51F" (UTF-32) was equivalent to "\ud83d\udd1f" (UCS-2).
However, this is true in languages like JavaScript. This also means that a non-BMP character is not equivalent to its surrogate pair (example: <code>"\U0001F51F" != "\ud83d\udd1f"</code>) unlike most programming languages. -->
The Julia programming language moved away from built-in UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package) following the "UTF-8 Everywhere Manifesto".
C++11 has 2 built-in data types that use UTF-32. The <code>char32_t</code> data type stores 1 character in UTF-32. The <code>u32string</code> data type stores a string of UTF-32-encoded characters. A UTF-32-encoded character or string literal is marked with <code>U</code> before the character or string literal.
<syntaxhighlight lang="c++">
- include <string>
char32_t UTF32_character = U'🔟'; // also written as U'\U0001F51F'
std::u32string UTF32_string = U"UTF–32-encoded string"; // defined as `const char32_t*´
</syntaxhighlight>C# has a <code>UTF32Encoding</code> class which represents Unicode characters as bytes, rather than as a string.
Variants
Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the WTF-8 variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to CESU-8. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this.
UTF-32 has 2 versions for big-endian and little-endian: UTF-32-BE and UTF-32-LE.
See also
- Comparison of Unicode encodings
- UTF-16
Notes
References
External links
- The Unicode Standard 5.0.0, chapter 3 formally defines UTF-32 in § 3.9, D90 (PDF page 40) and § 3.10, D99-D101 (PDF page 45)
- Unicode Standard Annex #19 formally defined UTF-32 for Unicode 3.x (March 2001; last updated March 2002)
- Registration of new charsets: UTF-32, UTF-32BE, UTF-32LE announcement of UTF-32 being added to the IANA charset registry (April 2002)
