Character Sets

From RAD Studio
Jump to: navigation, search

Go Up to Enabling Application Code to Work for Different Locales


All Windows versions after Windows NT use Unicode with UTF-16 encoding. In Delphi, UTF-16 encoded characters are represented as wide characters.

Old versions (Windows 95/98/Me) of the Western European Windows editions (including English, French, and German) use the ANSI Latin-1 character set. However, other editions of these (pre-Windows NT) Windows versions use different character sets. For example, the Japanese version of Windows 95 uses the Shift-JIS character set, which represents Japanese characters as multibyte character codes.

There are generally three types of character sets:

We strongly recommend using Unicode in your applications, but many applications or data files still depend on the legacy code pages.

Single-byte Characters

In a single-byte character set, each character is represented by a single byte number. That is, single-byte character sets use character encoding schemes that map each character to a numeric value (or code point) that is represented by one byte (8-bits). A code page is a specific mapping (encoding scheme) of characters in a character set to code points.

In Delphi, single-byte characters can be represented with the AnsiChar type.

Single-byte character sets (or code pages) were used by old (pre-Windows NT) Windows versions. For example, the ANSI Latin-1 character set (1252 code page) is the single-byte character set for the old Western European Windows editions. ANSI is the acronym for the American National Standards Institute.

Characters numbered 32 through 127 (lower 7-bits) are the same for each single-byte character set and form the 7-bit set called ASCII. ASCII is the acronym for the American Standard Code for Information Interchange.

Windows code pages, commonly called ANSI code pages, are single-byte code pages for which non-ASCII values represent international characters. Non-ASCII range characters (numbered 128 through 255) are called extended characters and varied from code page to code page. The set of extended characters determines which languages the code page could support. These ANSI code pages are used natively in old Windows versions and are still available on Windows NT and later.

Original equipment manufacturer (OEM) code pages are code pages for which non-ASCII values represent line drawing and punctuation characters. These code pages were originally used for MS-DOS and are still used as the default code pages for console applications. OEM code pages are also used for non-extended filenames in the old FAT12, FAT16, and FAT32 file systems. In contrast, the NTFS file system stores filenames in Unicode.

Your application can convert between Windows and OEM code pages using the standard runtime library functions CharToOem, CharToOemBuf. However, the use of these functions presents a risk of data loss because the characters that can be represented by matching Windows and OEM code pages do not match exactly.

In Western European Windows editions, the usual OEM code page is 437. It differs from the code page 1252 (ANSI Latin-1). For other Windows locale, including far-east Windows, OEM code pages equal to ANSI code pages.

Multibyte Characters

The ideographic character sets used in Asia cannot use the simple 1:1 mapping between characters in the language and the one-byte AnsiChar type characters. These languages have too many characters to be represented using single-byte characters.

One approach to working with ideographic character sets is to use a character encoding scheme that maps each character to a numeric value (or code point) that is larger than one byte. Such characters are referred to as multibyte characters. Multibyte character sets provide a way to encode characters outside the standard ANSI range.

In a multibyte character set, some characters are represented by one byte and others by more than one byte (2, 3, and 4 bytes).

Many systems for encoding of multibyte character sets have been devised. For instance, the Shift-JIS character set (code page 932) is the character encoding for Japanese. In Shift-JIS, characters are represented by one- or two-byte code points. A string of such characters is a single-byte string with characters of variable-width encoding.

Multibyte character set, or MBCS, is a term used to describe code pages that are encoded into single-byte strings. Such encodings typically have single-byte characters that are provided for backward compatibility. The first byte of a multibyte character is called the lead byte. In general, the lower 128 characters of a multibyte character set map to the 7-bit ASCII characters, and any byte whose ordinal value is greater than 127 is the lead byte of a multibyte character. Therefore, interpretation of each byte in a string using any multibyte encoding, which contains sequences of two or more bytes, depends on a conversion state determined by bytes earlier in the sequence of characters. Thus, the only way to tell whether a particular byte in a string represents a single-byte character or is a part of a multibyte character is to read the string, starting at the beginning, parsing it into two or more byte characters when a lead byte with the value greater than 127 is encountered.

When writing code for Asian locales, you must be sure to handle all string manipulations using functions that are enabled to parse strings into multibyte characters. For these reasons, you cannot process multibyte character strings as you process single-byte character strings. You should use a string type appropriate for multibyte character data such as the AnsiString type.

The following units provide functions for handling string manipulation and parsing strings into multibyte characters:

Remember that the length of such strings in bytes does not necessarily correspond to the length of the strings in characters. Be careful not to truncate strings by cutting a multibyte character in half. Do not pass multibyte characters as a parameter to a function or procedure, since the size of a multibyte character can't be known up front. Instead, always pass a pointer to a character or a string.

Multibyte character sets - especially double-byte character sets (DBCS) - were widely used for Asian languages in old (pre-Windows NT) Windows operating systems.

Wide Characters - Unicode

Ideographic character sets can also be represented in Unicode. Unicode consists of two features:

  • Universal Character Set provides a repertoire of more than 100,000 characters (code points).
  • Unicode Transformation Format (UTF) encoding schemes. UTF is a variable-length character encoding for Unicode. UTF is capable of encoding the entire Unicode character repertoire. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 and UTF-16.

Since Windows NT, all Windows versions are built from the ground up using Unicode. That is, all the Windows core functions for creating windows, displaying text, performing string manipulations, and so forth require Unicode strings. If you call any Windows core function passing it an ANSI string (a string of 1-byte characters), the function first converts the string to Unicode and then passes the Unicode string to the operating system. All these conversions occur invisibly to you.

The Unicode standard describes a system for representing characters used in all the world's writing systems. Unicode's Universal Character Set can represent a codespace of 1,114,112 characters or code points. Depending on the encoding, Unicode characters may be encoded as 1, 2, 3, or 4 bytes. The Unicode Transformation Format (UTF) encoding systems are equivalent representations of characters and are easily converted between each other:

  • UTF-8 uses one to four bytes per code point. (It uses 1 byte for all ASCII characters and up to 4 bytes for other characters).
  • UTF-32 represents each character as four bytes.
  • UTF-16 uses either two or four bytes. Characters in the Basic Multilingual Plane (BMP) that contains most of the world's characters in the current use can be represented in two bytes (16-bits). For characters in the other planes, the UTF-16 encoding will result in a pair of 16-bit words, together called a surrogate pair. The first 256 Unicode characters map to the ANSI character set.

We strongly recommend using Unicode in your applications.

The Windows operating system supports UTF-16 (which maps each character to a sequence of 16-bit numeric values). In Delphi, Unicode character strings can be represented with the UnicodeString or WideString types.

  • The WideString type represents a string of two byte character elements. Such 16-bit characters are referred to as wide characters. A WideChar is a two byte element, and a PWideChar is a pointer to a null-terminated string of two byte character elements. The WideString character type is essentially the same as a Windows BSTR; therefore, WideString should be used in COM applications. A WideString typically contains UTF-16 encoded characters. Since a code point may be represented by 2 or 4 bytes, the number of 2 byte elements in a WideString is not necessarily the number of characters in the string.
  • The UnicodeString type represents Unicode character strings. The UnicodeString type is reference counted.

Though the 'WideString type is appropriate for use in COM applications, however WideString is not reference counted. UnicodeString is more flexible and efficient in other types of applications. In addition, more functions are available for handling UnicodeString types than WideString, so UnicodeString is generally preferred.

The AnsiString type is used to represent single-character strings and could be used for MBCS. AnsiString is not used for Unicode. The term MBCS is not used to refer to Unicode.

See Also