Unicode in RAD Studio

From RAD Studio
Jump to: navigation, search

Go Up to Getting Started with RAD Studio


RAD Studio uses Unicode-based strings: that is, the type string is a Unicode string (System.UnicodeString) instead of an ANSI string. This topic describes what you need to know to handle strings properly.

If you want to use ANSI strings or wide strings, use the AnsiString and WideString types.

RAD Studio is fully Unicode-compliant, and some changes might be required to those parts of your code that involve string handling. However, every effort has been made to keep these changes to a minimum. Although new data types are introduced, existing data types remain and function as they always have. Based on the in-house experience of Unicode conversion, existing developer applications should migrate fairly smoothly.

For additional resources:

Existing String Types

The pre-existing data types AnsiString and System.WideString function the same way as before.

Short strings also function the same as before. Note that short strings are limited to 255 characters and contain only a character count and single-byte character data. They do not contain code page information. A short string could contain UTF-8 data for a particular application, but this is not generally true.

AnsiString

Previously, string was an alias for AnsiString. This table shows the location of the fields in AnsiString's previous format:

Previous format of AnsiString Data Type

Reference Count Length String Data (Byte sized) Null Term
-8
-4
0
Length

For RAD Studio, the format of AnsiString has changed. Two new fields (CodePage and ElemSize) have been added. This makes the format for AnsiString identical for the new UnicodeString type. (See Long String Types for more information about the new format.)

WideString

System.WideString was previously used for Unicode character data. Its format is essentially the same as the Windows BSTR. WideString is still appropriate for use in COM applications.

New String Type: UnicodeString

The type of string in RAD Studio is the UnicodeString type.

For Delphi, Char and PChar types are now WideChar and PWideChar, respectively.

Note: This differs from versions prior to 2009, in which string was an alias for AnsiString, and the Char and PChar types were AnsiChar and PAnsiChar, respectively.

For C++, the _TCHAR maps to option controls the floating definition of _TCHAR, which can be either wchar_t or char.

RAD Studio frameworks and libraries use the UnicodeString type; they do not represents string values as single byte or MBCS strings.

Format of UnicodeString Data Type

CodePage Element Size Reference Count Length String Data (element sized) Null Term
-12
-10
-8
-4
0
Length * elementsize


UnicodeString may be represented as the following Delphi structure:

type StrRec = record
      CodePage: Word;
      ElemSize: Word;
      refCount: Integer;
      Len: Integer;
      case Integer of
          1: array[0..0] of AnsiChar;
          2: array[0..0] of WideChar;
end;

UnicodeString adds the CodePage code page and ElemSize element size fields that describe the string contents. UnicodeString is assignment-compatible with all other string types. However, assignments between AnsiString and UnicodeString still do the appropriate up or down conversions. Note that assigning a UnicodeString type to an AnsiString type is not recommended and can result in data loss.

Note that AnsiString also has CodePage and ElemSize fields.

UnicodeString data is in UTF-16 for the following reasons:

  • UTF-16 matches the underlying operating system format.
  • UTF-16 reduces extra explicit/implicit conversions.
  • It offers better performance when calling the Windows API.
  • There is no need to have the operating system do any conversions with UTF-16.
  • The Basic Multilingual Plane (BMP) already contains the vast majority of the world's active language glyphs and fits in a single UTF-16 Char (16 bits).
  • Unicode surrogate pairs are analogous to the multibyte character set (MBCS), but more predictable and standard.
  • UnicodeString can provide lossless implicit conversions to and from WideString for marshaling COM interfaces.

Characters in UTF-16 may be 2 or 4 bytes, so the number of elements in a string is not necessarily equal to the number of characters. If the string has only BMP characters, the number of characters and elements are equal.

UnicodeString offers the following benefits:

  • It is reference-counted.
  • It solves a legacy application problem in C++Builder.
  • Allowing AnsiString to carry encoding information (code page) reduces the potential data loss problem with implicit casts.
  • The compiler ensures the data is correct before mutating data.

WideString is not reference-counted, and so UnicodeString is more flexible and efficient in most types of applications (WideString is more appropriate for COM).

Indexing

Instances of UnicodeString can index characters. Indexing is 1-based, just as for AnsiString. Consider the following code:

var C: Char;
    S: string;
    begin
        ...
        C := S[1];
        ...
    end;

In a case such as shown above, the compiler needs to ensure that data in S is in the proper format. The compiler generates code to ensure that assignments to string elements are the proper type and that the instance is unique (that is, has a reference count of one) via a call to a UniqueString function. For the above code, since the string could contain Unicode data, the compiler needs to also call the appropriate UniqueString function before indexing into the character array.

Compiler Conditionals

In both Delphi and C++Builder, you can use conditionals to allow both Unicode and non-Unicode code in the same source.

Delphi

{$IFDEF UNICODE}

C++Builder

#ifdef _DELPHI_STRING_UNICODE 

Summary of Changes

  • string now maps to UnicodeString, not to AnsiString.
  • Char now maps to WideChar (2 bytes, not 1 byte) and is a UTF-16 character.
  • PChar now maps to PWideChar.
  • In C++, System::String now maps to the UnicodeString class.

Summary of What Has Not Changed

  • AnsiString.
  • WideString.
  • AnsiChar, PAnsiChar.
  • WideChar, PWideChar
  • Implicit conversions still work.
  • AnsiString uses the user's active code page.

Code Constructs Independent of Character Size

The following operations do not depend on character size:

  • String concatenation:
    • <string var> + <string var>
    • <string var> + <literal>
    • <literal> + <literal>
    • Concat(<string> , <string>)
  • Standard string functions:
    • Length(<string>) returns the number of char elements, which might not be the same as the number of bytes. Note that the SizeOf function returns the number of bytes, which means that the return value for SizeOf might differ from Length.
    • Copy(<string>, <start>, <length>) returns a substring in Char elements.
    • Pos(<substr>, <string>) returns the index of the first Char element.
  • Operators:
    • <string> <comparison_operator> <string>
    • CompareStr()
    • CompareText()
    • ...
  • FillChar(<struct or memory>)
    • FillChar(Rect, SizeOf(Rect), #0)
    • FillChar(WndClassEx, SizeOf(TWndClassEx), #0). Note that WndClassEx.cbSize := SizeOf(TWndClassEx);
  • Windows API
    • API calls default to their WideString ("W") versions.
    • The PChar(<string>) cast has identical semantics.


GetModuleFileName example:

function ModuleFileName(Handle: HMODULE): string;
    var Buffer: array[0..MAX_PATH] of Char;
        begin
            SetString(Result, Buffer, 
                      GetModuleFileName(Handle, Buffer, Length(Buffer)));
        end;

GetWindowText example:

function WindowCaption(Handle: HWND): string;
      begin
          SetLength(Result, 1024);
          SetLength(Result, 
                    GetWindowText(Handle, PChar(Result), Length(Result)));
      end;

String character indexing example:

function StripHotKeys(const S: string): string;
    var I, J: Integer;
    LastChar: Char;
    begin
        SetLength(Result, Length(S));
        J := 0;
        LastChar := #0;
        for I := 1 to Length(S) do
        begin
          if (S[I] <> '&') or (LastChar = '&') then
          begin
              Inc(J);
              Result[J] := S[I];
          end;
          LastChar := S[I];
    end;
    SetLength(Result, J);
end;

Code Constructs that Depend on Character Size

Some operations depend on character size. The functions and features in the following list also include a "portable" version, when possible. You can similarly rewrite your code to be portable, that is, the code works with both AnsiString and UnicodeString variables.

  • SizeOf(<Char array>) -- use the portable Length(<Char array>).
  • Move(<Char buffer>... CharCount) -- use the portable Move(<Char buffer> ... CharCount * SizeOf(Char)).
  • Stream Read/Write -- use the portable AnsiString, SizeOf(Char), or the TEncoding class.
  • FillChar(<Char array>, <size>, <AnsiChar>) -- use *SizeOf(Char) if filling with #0, or use the portable StringOfChar function.
  • GetProcAddress(<module>, <PAnsiChar>) -- use the provided overload function taking a PWideChar.
  • Casting or using PChar to do pointer arithmetic -- Place {IFDEF PByte = PChar} at the top of the file if you use PChar for pointer arithmetic. Or use the {POINTERMATH <ON|OFF>} Delphi compiler directive to turn on pointer arithmetic for all typed pointers, so that increment/decrement is by element size.

Set of Char Constructs

You may need to modify these constructs.

  • <Char> in <set of AnsiChar> -- code generation is correct (>#255 characters are never in the set). The compiler warns WideChar reduced in set operations. Depending on your code, you can safely turn off the warning. Alternatively, use the CharinSet function.
  • <Char> in LeadBytes -- the global LeadBytes set is for MBCS ANSI locales. UTF-16 still has the notion of a "lead char" (#$D800 - #$DBFF are high surrogate, #$DC00 - #$DFFF are low surrogate). To change this, use the overloaded function IsLeadChar. The ANSI version checks against LeadBytes. The WideChar version checks if it is a high/low surrogate.
  • Character classification -- use the TCharacter static class. The Character unit offers functions to classify characters: IsDigit, IsLetter, IsLetterOrDigit, IsSymbol, IsWhiteSpace, IsSurrogatePair, and so on. These are based on table data directly from Unicode.org.

Beware of these Constructs

You should examine the following problematic code constructs:

  • Casts that obscure the type:
    • AnsiString(Pointer(foo))
    • Review for correctness: what was intended?
  • Suspicious casts -- generate a warning:
    • PChar(<AnsiString var>)
    • PAnsiChar(<UnicodeString var>)
  • Directly constructing, manipulating, or accessing string internal structures. Some, such as AnsiString, have changed internally, so this is unsafe. Use the StringRefCount, StringCodePage, StringElementSize, and other functions to get string information.

Runtime Library

  • Overloads. For functions that took PChar, there are now PAnsiChar and PWideChar versions so the appropriate function gets called.
  • AnsiXXX functions are a consideration:
  • Write/Writeln and Read/Readln:
    • Continue to convert to/from ANSI/OEM code pages.
    • Console is mostly ANSI or OEM anyway.
    • Offer better compatibility with legacy applications.
    • TFDD (Text File Device Drivers):
    • Use TEncoding and TStrings for Unicode file I/O.
  • PByte - declared with $POINTERMATH ON. This allows array indexing and pointer math like PAnsiChar.
  • RTL provides helper functions that enable users to do explicit conversions between code pages and element size conversions. If developers are using the Move function on a character array, they cannot make assumptions about the element size. Much of this problem can be mitigated by making sure all RValue references generate the proper calls to RTL to ensure proper element sizes.

Components and Classes

  • TStrings: Store UnicodeString internally (remains declared as string).
  • TWideStrings (may get deprecated) is unchanged. Uses WideString (BSTR) internally.
  • TStringStream
    • Has been rewritten –- defaults to the default ANSI encoding for internal storage.
    • Encoding can be overridden.
    • Consider using TStringBuilder instead of TStringStream to construct a string from bits and pieces.
  • TEncoding
    • Defaults to users’ active code page.
    • Supports UTF-8.
    • Supports UTF-16, big and little endian.
    • Byte Order Mark (BOM) support.
    • You can create descendent classes for user-specific encodings.
  • Component streaming (Text DFM files):
    • Are fully backward-compatible.
    • Stream as UTF-8 only if component type, property, or name contains non-ASCII-7 characters.
    • String property values are still streamed in “#” escaped format.
    • May allow values as UTF-8 as well (open issue).
    • Only change in binary format is potential for UTF-8 data for component name, properties, and type name.

Byte Order Mark

The Byte Order Mark (BOM) should be added to files to indicate their encoding:

  • UTF-8 uses EF BB BF.
  • UTF-16 Little Endian uses FF FE.
  • UTF-16 Big Endian uses FE FF.

Steps to Unicode-enable your applications

You need to perform these steps:

  1. Review char- and string-related functions.
  2. Rebuild the application.
  3. Review surrogate pairs.
  4. Review string payloads.

For more details, see Enabling Your Applications for Unicode.

New Delphi compiler warnings

New warnings have been added to the Delphi compiler related to possible errors in casting types (such as from a UnicodeString or a WideString down to an AnsiString or AnsiChar). When you are converting an application to Unicode, you should enable warnings 1057 and 1058 to assist in finding problem areas in your code.

  • W1050 WideChar reduced to byte char in set expressions (Delphi). "Set of char" in Win32 defines a set over the entire range of the Char type. Since Char is a byte-sized type in Win32, this defines a set of maximum size containing 256 elements. In .NET, Char is a word-sized type, and this range (0..65535) exceeds the capacity of the set type.
  • W1057 Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST) Emitted when the compiler detects a case where it must implicitly convert an AnsiString (or AnsiChar) to some form of Unicode (a UnicodeString or a WideString). (NOTE: This warning will eventually be enabled by default.)
  • W1058 Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case were it must implicitly convert some form of Unicode (a UnicodeString or a WideString) down to an AnsiString (or AnsiChar). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will eventually be enabled by default.)
  • W1059: Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST) Emitted when the compiler detects a case where the programmer is explicitly casting an AnsiString (or AnsiChar) to some form of Unicode (UnicodeString or WideString). (NOTE: This warning will always be off by default and should only be used to locate potential problems).
  • W1060 Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case where the programmer is explicitly casting some form of Unicode (UnicodeString or WideString) down to AnsiString (or AnsiChar). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will always be off by default and should only be used to locate potential problems.)

Recommendations

  • Keep source files in UTF-8 format:
    • Files can remain ANSI as long as the source is compiled with the correct code page. Select Project > Options > C++ Compiler > Advanced and use the "Code page" option under Other Options to set the correct code page.
    • Write a UTF-8 BOM to source file. Make sure your source control management system supports these files (most do).
  • Perform IDE refactoring when code must be AnsiString or AnsiChar (code is still portable).
  • Static code review:
    • Is code merely passing the data along?
    • Is code doing simple character indexing?
  • Heed all warnings (elevate to errors):
    • Suspicious pointer casts.
    • Implicit/Explicit casts (coming).
  • Determine code intent
    • Is code using a string (AnsiString) as a dynamic array of bytes? If so, use the portable TBytes type (array of Byte) instead.
    • Is a PChar cast used to enable pointer arithmetic? If so, cast to PByte instead and turn $POINTERMATH ON.

See Also