Unicode in RAD Studio
Go Up to Getting Started with RAD Studio
RAD Studio uses Unicode-based strings: that is, the type string is a Unicode string (System.UnicodeString) instead of an ANSI string. This topic describes what you need to know to handle strings properly.
If you want to use ANSI strings or wide strings, use the AnsiString and WideString types.
RAD Studio is fully Unicode-compliant, and some changes might be required to those parts of your code that involve string handling. However, every effort has been made to keep these changes to a minimum. Although new data types are introduced, existing data types remain and function as they always have. Based on the in-house experience of Unicode conversion, existing developer applications should migrate fairly smoothly.
For additional resources:
- Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines, by Cary Jensen
- Unicode Migration Resources for Delphi, C++Builder and RAD Studio
Contents
- 1 Existing String Types
- 2 New String Type: UnicodeString
- 3 Compiler Conditionals
- 4 Summary of Changes
- 5 Summary of What Has Not Changed
- 6 Code Constructs Independent of Character Size
- 7 Code Constructs that Depend on Character Size
- 8 Set of Char Constructs
- 9 Beware of these Constructs
- 10 Runtime Library
- 11 Components and Classes
- 12 Byte Order Mark
- 13 Steps to Unicode-enable your applications
- 14 New Delphi compiler warnings
- 15 Recommendations
- 16 See Also
Existing String Types
The pre-existing data types AnsiString and System.WideString function the same way as before.
Short strings also function the same as before. Note that short strings are limited to 255 characters and contain only a character count and single-byte character data. They do not contain code page information. A short string could contain UTF-8 data for a particular application, but this is not generally true.
AnsiString
Previously, string was an alias for AnsiString. This table shows the location of the fields in AnsiString's previous format:
Previous format of AnsiString Data Type
| Reference Count | Length | String Data (Byte sized) | Null Term |
|---|---|---|---|
-8 |
-4 |
0 |
Length |
For RAD Studio, the format of AnsiString has changed. Two new fields (CodePage and ElemSize) have been added. This makes the format for AnsiString identical for the new UnicodeString type. (See Long String Types for more information about the new format.)
WideString
System.WideString was previously used for Unicode character data. Its format is essentially the same as the Windows BSTR. WideString is still appropriate for use in COM applications.
New String Type: UnicodeString
The type of string in RAD Studio is the UnicodeString type.
For Delphi, Char and PChar types are now WideChar and PWideChar, respectively.
string was an alias for AnsiString, and the Char and PChar types were AnsiChar and PAnsiChar, respectively.For C++, the _TCHAR maps to option controls the floating definition of _TCHAR, which can be either wchar_t or char.
RAD Studio frameworks and libraries use the UnicodeString type; they do not represents string values as single byte or MBCS strings.
Format of UnicodeString Data Type
| CodePage | Element Size | Reference Count | Length | String Data (element sized) | Null Term |
|---|---|---|---|---|---|
-12 |
-10 |
-8 |
-4 |
0 |
Length * elementsize |
UnicodeString may be represented as the following Delphi structure:
type StrRec = record
CodePage: Word;
ElemSize: Word;
refCount: Integer;
Len: Integer;
case Integer of
1: array[0..0] of AnsiChar;
2: array[0..0] of WideChar;
end;
UnicodeString adds the CodePage code page and ElemSize element size fields that describe the string contents. UnicodeString is assignment-compatible with all other string types. However, assignments between AnsiString and UnicodeString still do the appropriate up or down conversions. Note that assigning a UnicodeString type to an AnsiString type is not recommended and can result in data loss.
Note that AnsiString also has CodePage and ElemSize fields.
UnicodeString data is in UTF-16 for the following reasons:
- UTF-16 matches the underlying operating system format.
- UTF-16 reduces extra explicit/implicit conversions.
- It offers better performance when calling the Windows API.
- There is no need to have the operating system do any conversions with UTF-16.
- The Basic Multilingual Plane (BMP) already contains the vast majority of the world's active language glyphs and fits in a single UTF-16
Char(16 bits). - Unicode surrogate pairs are analogous to the multibyte character set (MBCS), but more predictable and standard.
UnicodeStringcan provide lossless implicit conversions to and fromWideStringfor marshaling COM interfaces.
Characters in UTF-16 may be 2 or 4 bytes, so the number of elements in a string is not necessarily equal to the number of characters. If the string has only BMP characters, the number of characters and elements are equal.
UnicodeString offers the following benefits:
- It is reference-counted.
- It solves a legacy application problem in C++Builder.
- Allowing
AnsiStringto carry encoding information (code page) reduces the potential data loss problem with implicit casts. - The compiler ensures the data is correct before mutating data.
WideString is not reference-counted, and so UnicodeString is more flexible and efficient in most types of applications (WideString is more appropriate for COM).
Indexing
Instances of UnicodeString can index characters. Indexing is 1-based, just as for AnsiString. Consider the following code:
var C: Char;
S: string;
begin
...
C := S[1];
...
end;
In a case such as shown above, the compiler needs to ensure that data in S is in the proper format. The compiler generates code to ensure that assignments to string elements are the proper type and that the instance is unique (that is, has a reference count of one) via a call to a UniqueString function. For the above code, since the string could contain Unicode data, the compiler needs to also call the appropriate UniqueString function before indexing into the character array.
Compiler Conditionals
In both Delphi and C++Builder, you can use conditionals to allow both Unicode and non-Unicode code in the same source.
Delphi
{$IFDEF UNICODE}
C++Builder
#ifdef _DELPHI_STRING_UNICODE
Summary of Changes
stringnow maps toUnicodeString, not toAnsiString.Charnow maps toWideChar(2 bytes, not 1 byte) and is a UTF-16 character.PCharnow maps toPWideChar.- In C++,
System::Stringnow maps to theUnicodeStringclass.
Summary of What Has Not Changed
AnsiString.WideString.AnsiChar,PAnsiChar.WideChar,PWideChar- Implicit conversions still work.
AnsiStringuses the user's active code page.
Code Constructs Independent of Character Size
The following operations do not depend on character size:
- String concatenation:
<string var> + <string var><string var> + <literal><literal> + <literal>Concat(<string> , <string>)
- Standard string functions:
Length(<string>)returns the number of char elements, which might not be the same as the number of bytes. Note that theSizeOffunction returns the number of bytes, which means that the return value forSizeOfmight differ fromLength.Copy(<string>, <start>, <length>)returns a substring inCharelements.Pos(<substr>, <string>)returns the index of the firstCharelement.
- Operators:
<string> <comparison_operator> <string>CompareStr()CompareText()...
FillChar(<struct or memory>)FillChar(Rect, SizeOf(Rect), #0)FillChar(WndClassEx, SizeOf(TWndClassEx), #0). Note thatWndClassEx.cbSize := SizeOf(TWndClassEx);
- Windows API
- API calls default to their
WideString("W") versions. - The
PChar(<string>)cast has identical semantics.
- API calls default to their
GetModuleFileName example:
function ModuleFileName(Handle: HMODULE): string;
var Buffer: array[0..MAX_PATH] of Char;
begin
SetString(Result, Buffer,
GetModuleFileName(Handle, Buffer, Length(Buffer)));
end;
GetWindowText example:
function WindowCaption(Handle: HWND): string;
begin
SetLength(Result, 1024);
SetLength(Result,
GetWindowText(Handle, PChar(Result), Length(Result)));
end;
String character indexing example:
function StripHotKeys(const S: string): string;
var I, J: Integer;
LastChar: Char;
begin
SetLength(Result, Length(S));
J := 0;
LastChar := #0;
for I := 1 to Length(S) do
begin
if (S[I] <> '&') or (LastChar = '&') then
begin
Inc(J);
Result[J] := S[I];
end;
LastChar := S[I];
end;
SetLength(Result, J);
end;
Code Constructs that Depend on Character Size
Some operations depend on character size. The functions and features in the following list also include a "portable" version, when possible. You can similarly rewrite your code to be portable, that is, the code works with both AnsiString and UnicodeString variables.
SizeOf(<Char array>)-- use the portableLength(<Char array>).Move(<Char buffer>... CharCount)-- use the portableMove(<Char buffer> ... CharCount * SizeOf(Char)).- Stream Read/Write -- use the portable
AnsiString,SizeOf(Char), or theTEncodingclass. FillChar(<Char array>, <size>, <AnsiChar>)-- use*SizeOf(Char)if filling with#0, or use the portableStringOfCharfunction.GetProcAddress(<module>, <PAnsiChar>)-- use the provided overload function taking aPWideChar.- Casting or using
PCharto do pointer arithmetic -- Place{IFDEF PByte = PChar}at the top of the file if you usePCharfor pointer arithmetic. Or use the{POINTERMATH <ON|OFF>}Delphi compiler directive to turn on pointer arithmetic for all typed pointers, so that increment/decrement is by element size.
Set of Char Constructs
You may need to modify these constructs.
- <Char> in <set of
AnsiChar> -- code generation is correct (>#255characters are never in the set). The compiler warnsWideChar reduced in set operations. Depending on your code, you can safely turn off the warning. Alternatively, use theCharinSetfunction. - <Char> in
LeadBytes-- the globalLeadBytesset is for MBCS ANSI locales. UTF-16 still has the notion of a "lead char" (#$D800 - #$DBFFare high surrogate,#$DC00 - #$DFFFare low surrogate). To change this, use the overloaded functionIsLeadChar. The ANSI version checks againstLeadBytes. TheWideCharversion checks if it is a high/low surrogate. - Character classification -- use the
TCharacterstatic class. TheCharacterunit offers functions to classify characters:IsDigit,IsLetter,IsLetterOrDigit,IsSymbol,IsWhiteSpace,IsSurrogatePair, and so on. These are based on table data directly from Unicode.org.
Beware of these Constructs
You should examine the following problematic code constructs:
- Casts that obscure the type:
AnsiString(Pointer(foo))- Review for correctness: what was intended?
- Suspicious casts -- generate a warning:
PChar(<AnsiString var>)PAnsiChar(<UnicodeString var>)
- Directly constructing, manipulating, or accessing string internal structures. Some, such as
AnsiString, have changed internally, so this is unsafe. Use theStringRefCount,StringCodePage,StringElementSize, and other functions to get string information.
Runtime Library
- Overloads. For functions that took
PChar, there are nowPAnsiCharandPWideCharversions so the appropriate function gets called. AnsiXXXfunctions are a consideration:SysUtils.AnsiXXXXfunctions, such asAnsiCompareStr:- Remain declared with
stringand float toUnicodeString. - Offer better backward compatibility (no need to change code).
- Remain declared with
- The
AnsiStringsunit'sAnsiXXXXfunctions offer the same capabilities as theSysUtils.AnsiXXXXfunctions, but work only forAnsiString. Also, theAnsiStrings.AnsiXXXXfunctions provide better performance for anAnsiStringthanSysUtils.AnsiXXXXfunctions, which work for bothAnsiStringandUnicodeString, because no implicit conversions are performed.
Write/WritelnandRead/Readln:
PByte- declared with$POINTERMATH ON. This allows array indexing and pointer math likePAnsiChar.
- String information functions:
StringElementSizereturns the actual data size.StringCodePagereturns the code page of string data.System.StringRefCountreturns the reference count.
- RTL provides helper functions that enable users to do explicit conversions between code pages and element size conversions. If developers are using the
Movefunction on a character array, they cannot make assumptions about the element size. Much of this problem can be mitigated by making sure all RValue references generate the proper calls to RTL to ensure proper element sizes.
Components and Classes
TStrings: StoreUnicodeStringinternally (remains declared asstring).TWideStrings(may get deprecated) is unchanged. UsesWideString(BSTR) internally.TStringStream- Has been rewritten –- defaults to the default ANSI encoding for internal storage.
- Encoding can be overridden.
- Consider using
TStringBuilderinstead ofTStringStreamto construct a string from bits and pieces.
TEncoding- Defaults to users’ active code page.
- Supports UTF-8.
- Supports UTF-16, big and little endian.
- Byte Order Mark (BOM) support.
- You can create descendent classes for user-specific encodings.
- Component streaming (Text DFM files):
- Are fully backward-compatible.
- Stream as UTF-8 only if component type, property, or name contains non-ASCII-7 characters.
- String property values are still streamed in “#” escaped format.
- May allow values as UTF-8 as well (open issue).
- Only change in binary format is potential for UTF-8 data for component name, properties, and type name.
Byte Order Mark
The Byte Order Mark (BOM) should be added to files to indicate their encoding:
- UTF-8 uses
EF BB BF. - UTF-16 Little Endian uses
FF FE. - UTF-16 Big Endian uses
FE FF.
Steps to Unicode-enable your applications
You need to perform these steps:
- Review char- and string-related functions.
- Rebuild the application.
- Review surrogate pairs.
- Review string payloads.
For more details, see Enabling Your Applications for Unicode.
New Delphi compiler warnings
New warnings have been added to the Delphi compiler related to possible errors in casting types (such as from a UnicodeString or a WideString down to an AnsiString or AnsiChar). When you are converting an application to Unicode, you should enable warnings 1057 and 1058 to assist in finding problem areas in your code.
- W1050 WideChar reduced to byte char in set expressions (Delphi). "Set of char" in Win32 defines a set over the entire range of the Char type. Since Char is a byte-sized type in Win32, this defines a set of maximum size containing 256 elements. In .NET, Char is a word-sized type, and this range (0..65535) exceeds the capacity of the set type.
- W1057 Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST) Emitted when the compiler detects a case where it must implicitly convert an
AnsiString(orAnsiChar) to some form of Unicode (aUnicodeStringor aWideString). (NOTE: This warning will eventually be enabled by default.) - W1058 Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case were it must implicitly convert some form of Unicode (a
UnicodeStringor aWideString) down to anAnsiString(orAnsiChar). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will eventually be enabled by default.) - W1059: Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST) Emitted when the compiler detects a case where the programmer is explicitly casting an
AnsiString(orAnsiChar) to some form of Unicode (UnicodeStringorWideString). (NOTE: This warning will always be off by default and should only be used to locate potential problems). - W1060 Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case where the programmer is explicitly casting some form of Unicode (
UnicodeStringorWideString) down toAnsiString(orAnsiChar). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will always be off by default and should only be used to locate potential problems.)
Recommendations
- Keep source files in UTF-8 format:
- Files can remain ANSI as long as the source is compiled with the correct code page. Select Project > Options > C++ Compiler > Advanced and use the "Code page" option under Other Options to set the correct code page.
- Write a UTF-8 BOM to source file. Make sure your source control management system supports these files (most do).
- Perform IDE refactoring when code must be
AnsiStringorAnsiChar(code is still portable). - Static code review:
- Is code merely passing the data along?
- Is code doing simple character indexing?
- Heed all warnings (elevate to errors):
- Suspicious pointer casts.
- Implicit/Explicit casts (coming).
- Determine code intent
- Is code using a string (
AnsiString) as a dynamic array of bytes? If so, use the portableTBytestype (array ofByte) instead. - Is a
PCharcast used to enable pointer arithmetic? If so, cast toPByteinstead and turn$POINTERMATH ON.
- Is code using a string (
See Also
- UTF-8 Conversion Routines
- Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines, by Cary Jensen
- Delphi in a Unicode World Part I: What is Unicode, Why do you need it, and How do you work with it in Delphi?
- Delphi in a Unicode World Part II: New RTL Features and Classes to Support Unicode
- RAD Studio 2010 Migration Center
- Enabling Applications for Unicode
- Enabling C++ Applications for Unicode
- _TCHAR Mapping (C++)
- Using Unicode in the Command Console
- Using TEncoding for Unicode Files
- How to Handle Delphi AnsiString Code Page Specification in C++
- System.UnicodeString
- System.AnsiString