<!-- We need some good GBK 1.0 & 1993 GBK data. All I can find is CP936. -->
GBK is an extension of the GB 2312 character set for Simplified Chinese characters, used in the People's Republic of China. It includes all unified CJK characters found in , i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping, which differs from other implementations primarily by the single-byte euro sign at 0x80.
GB abbreviates Guójiā Biāozhǔn, which means national standard in Chinese, while K stands for Extension (扩展 kuòzhǎn). GBK not only extended the old standard with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the 镕 (róng) character in former Chinese Premier Zhu Rongji's name, are now representable.
, GBK is the third-most declared encoding served from China and territories (after UTF-8 and the subset ), with 1.3% of web servers serving a page that declares GBK. However, all major web browsers decode GB2312-marked documents as if they were marked GBK, i.e. not as a subset (meaning in effect GBK is the second-most popular encoding) except for Safari and Edge on the label <code>GB_2312</code> (they do however decode <code>GB_2312-80</code> and <code>GB2312</code> as the superset GBK). Together, GBK and encodings have a combined <!-- Note all declared encodings add up to more than 100% (currently 102%) so overlap with each other, and with UTF-8 mostly, so showing non-UTF-8, not 3.6% + 1.3% = 4.9% as declared, likely better to show as 100-96.5% = --> 3.5% presence in China and territories.
History
In 1993, the Unicode 1.1 standard was released, including 20,902 characters used in mainland China, Taiwan, Japan and Korea. Following this, China released , the Guobiao standard equivalent of Unicode 1.1.
The GBK character set was defined in 1993 as an extension of , while also including the characters of GB 13000.1-93 through the unused codepoints available in GB 2312. Hence GBK is backward compatible with GB 2312. GBK was defined in a normative annex to GB 13000.1-93.
Microsoft implemented GBK in Windows 95 and Windows NT 3.51 as Code Page 936. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming the de facto standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB 13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB 2312-80 and GB 13000.1-93.
In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Extension Specification (), Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points.
Microsoft later added the euro sign to Code page 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0.
In 2000, the standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to as GBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 24 characters are still mapped to Unicode PUA (see GB 18030#PUA.)
In 2002, GBK was registered as an IANA charset; the registration uses code page 936 mapping as well as CP936/MS936 aliases, but refers to GBK 1.0 specification. defines a GBK encoder as a GB 18030 encoder with a single-byte euro sign and without four-byte sequences (while W3C's GBK decoder specification has no such limitation, decodes as , i.e. with same range of letters as all of Unicode).
Encoding
A character is encoded as 1 or 2 bytes. A byte in the range <code>00</code>–<code>7F</code> is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 95 characters and 33 control codes in this range.
A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range <code>81</code>–<code>FE</code> (that is, never <code>80</code> or <code>FF</code>), and the second byte is <code>40</code>–<code>A0</code> except <code>7F</code> for some areas and <code>A1</code>–<code>FE</code> for others.
More specifically, the following ranges of bytes are defined:
{|class="wikitable"
|+GBK Encoding Ranges
|-
!rowspan="2"|range || rowspan="2"|byte 1 || rowspan="2"|byte 2 || rowspan="2"|code points || colspan="4"|characters
|-
!GB 18030 || GBK 1.0 || Codepage 936 || GB 2312
|-
|Level GBK/1 || <code>A1</code>–<code>A9</code> || <code>A1</code>–<code>FE</code>
|align="right"|846 || style="text-align:right;"|718
GBK's successor, , uses the remaining range available to the second byte (–) to further expand the number of possibilities while retaining GBK as a subset.
References
Notes
External links
- A scan of the GBK 1.0 specification provided by the Ideographic Research Group
- ICU's Authoritative GBK mapping - part of GB18030 data
- Microsoft Reference page for GBK
- Mapping of GBK to Unicode N.B.: this is Microsoft code page 936, which contains entries for 21791 double-byte code points, 96 single-byte graphic characters, and 33 control characters. This is not exactly the same as GBK which has 21886 characters.
- GBK Code Table N.B. This gbk-encoded page shows the available coding space totally populated except for 2 places, for a total of 32256 glyphs (32352 with the implied single-byte ASCII codes not illustrated), which is more than 23940 or 21886. Actual rendering of this table depends on your browser's GBK decoder.
