Unicode in RAD Studio
Go Up to Getting Started with RAD Studio
RAD Studio uses Unicode-based strings: that is, the type string
is a Unicode string (System.UnicodeString) instead of an ANSI string. This topic describes what you need to know to handle strings properly.
If you want to use ANSI strings or wide strings, use the AnsiString and WideString types.
RAD Studio is fully Unicode-compliant, and some changes might be required to those parts of your code that involve string handling. However, every effort has been made to keep these changes to a minimum. Although new data types are introduced, existing data types remain and function as they always have. Based on the in-house experience of Unicode conversion, existing developer applications should migrate fairly smoothly.
For additional resources:
- Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines, by Cary Jensen
- Migration Center
Contents
- 1 Existing String Types
- 2 New String Type: UnicodeString
- 3 Compiler Conditionals
- 4 Summary of Changes
- 5 Summary of What Has Not Changed
- 6 Code Constructs Independent of Character Size
- 7 Code Constructs that Depend on Character Size
- 8 Set of Char Constructs
- 9 Beware of these Constructs
- 10 Runtime Library
- 11 Components and Classes
- 12 Byte Order Mark
- 13 Steps to Unicode-enable your applications
- 14 New Delphi compiler warnings
- 15 Recommendations
- 16 See Also
Existing String Types
The pre-existing data types AnsiString
and System.WideString
function the same way as before.
Short strings also function the same as before. Note that short strings are limited to 255 characters and contain only a character count and single-byte character data. They do not contain code page information. A short string could contain UTF-8 data for a particular application, but this is not generally true.
AnsiString
Previously, string was an alias for AnsiString. This table shows the location of the fields in AnsiString
's previous format:
Previous format of AnsiString
Data Type
Reference Count | Length | String Data (Byte sized) | Null Term |
---|---|---|---|
-8 |
-4 |
0 |
Length |
For RAD Studio, the format of AnsiString has changed. Two new fields (CodePage
and ElemSize
) have been added. This makes the format for AnsiString identical for the new UnicodeString type. (See Long String Types for more information about the new format.)
WideString
System.WideString
was previously used for Unicode character data. Its format is essentially the same as the Windows BSTR
. WideString
is still appropriate for use in COM applications.
New String Type: UnicodeString
The type of string
in RAD Studio is the UnicodeString
type.
For Delphi, Char
and PChar
types are now WideChar
and PWideChar
, respectively.
Note: This differs from versions prior to 2009, in which
string
was an alias forAnsiString
, and theChar
andPChar
types wereAnsiChar
andPAnsiChar
, respectively.
For C++, the _TCHAR maps to option controls the floating definition of _TCHAR
, which can be either wchar_t
or char
.
RAD Studio frameworks and libraries use the UnicodeString
type; they do not represents string values as single byte or MBCS strings.
Format of UnicodeString
Data Type
CodePage | Element Size | Reference Count | Length | String Data (element sized) | Null Term |
---|---|---|---|---|---|
-12 |
-10 |
-8 |
-4 |
0 |
Length * elementsize |
UnicodeString
may be represented as the following Delphi structure:
type StrRec = record CodePage: Word; ElemSize: Word; refCount: Integer; Len: Integer; case Integer of 1: array[0..0] of AnsiChar; 2: array[0..0] of WideChar; end;
UnicodeString
adds the CodePage
code page and ElemSize
element size fields that describe the string contents. UnicodeString
is assignment-compatible with all other string types. However, assignments between AnsiString
and UnicodeString
still do the appropriate up or down conversions. Note that assigning a UnicodeString
type to an AnsiString
type is not recommended and can result in data loss.
Note that AnsiString
also has CodePage
and ElemSize
fields.
UnicodeString
data is in UTF-16 for the following reasons:
- UTF-16 matches the underlying operating system format.
- UTF-16 reduces extra explicit/implicit conversions.
- It offers better performance when calling the Windows API.
- There is no need to have the operating system do any conversions with UTF-16.
- The Basic Multilingual Plane (BMP) already contains the vast majority of the world's active language glyphs and fits in a single UTF-16
Char
(16 bits). - Unicode surrogate pairs are analogous to the multibyte character set (MBCS), but more predictable and standard.
UnicodeString
can provide lossless implicit conversions to and fromWideString
for marshaling COM interfaces.
Characters in UTF-16 may be 2 or 4 bytes, so the number of elements in a string is not necessarily equal to the number of characters. If the string has only BMP characters, the number of characters and elements are equal.
UnicodeString
offers the following benefits:
- It is reference-counted.
- It solves a legacy application problem in C++Builder.
- Allowing
AnsiString
to carry encoding information (code page) reduces the potential data loss problem with implicit casts. - The compiler ensures the data is correct before mutating data.
WideString
is not reference-counted, and so UnicodeString
is more flexible and efficient in most types of applications (WideString
is more appropriate for COM).
Indexing
Instances of UnicodeString
can index characters. Indexing is 1-based, just as for AnsiString
. Consider the following code:
var C: Char; S: string; begin ... C := S[1]; ... end;
In a case such as shown above, the compiler needs to ensure that data in S
is in the proper format. The compiler generates code to ensure that assignments to string elements are the proper type and that the instance is unique (that is, has a reference count of one) via a call to a UniqueString
function. For the above code, since the string could contain Unicode data, the compiler needs to also call the appropriate UniqueString
function before indexing into the character array.
Compiler Conditionals
In both Delphi and C++Builder, you can use conditionals to allow both Unicode and non-Unicode code in the same source.
Delphi
{$IFDEF UNICODE}
C++Builder
#ifdef _DELPHI_STRING_UNICODE
Summary of Changes
string
now maps toUnicodeString
, not toAnsiString
.Char
now maps toWideChar
(2 bytes, not 1 byte) and is a UTF-16 character.PChar
now maps toPWideChar
.- In C++,
System::String
now maps to theUnicodeString
class.
Summary of What Has Not Changed
AnsiString
.WideString
.AnsiChar
,PAnsiChar
.WideChar
,PWideChar
- Implicit conversions still work.
AnsiString
uses the user's active code page.
Code Constructs Independent of Character Size
The following operations do not depend on character size:
- String concatenation:
<string var> + <string var>
<string var> + <literal>
<literal> + <literal>
Concat(<string> , <string>)
- Standard string functions:
Length(<string>)
returns the number of char elements, which might not be the same as the number of bytes. Note that theSizeOf
function returns the number of bytes, which means that the return value forSizeOf
might differ fromLength
.Copy(<string>, <start>, <length>)
returns a substring inChar
elements.Pos(<substr>, <string>)
returns the index of the firstChar
element.
- Operators:
<string> <comparison_operator> <string>
CompareStr()
CompareText()
...
FillChar(<struct or memory>)
FillChar(Rect, SizeOf(Rect), #0)
FillChar(WndClassEx, SizeOf(TWndClassEx), #0)
. Note thatWndClassEx.cbSize := SizeOf(TWndClassEx);
- Windows API
- API calls default to their
WideString
("W") versions. - The
PChar(<string>)
cast has identical semantics.
- API calls default to their
GetModuleFileName
example:
function ModuleFileName(Handle: HMODULE): string; var Buffer: array[0..MAX_PATH] of Char; begin SetString(Result, Buffer, GetModuleFileName(Handle, Buffer, Length(Buffer))); end;
GetWindowText
example:
function WindowCaption(Handle: HWND): string; begin SetLength(Result, 1024); SetLength(Result, GetWindowText(Handle, PChar(Result), Length(Result))); end;
String character indexing example:
function StripHotKeys(const S: string): string; var I, J: Integer; LastChar: Char; begin SetLength(Result, Length(S)); J := 0; LastChar := #0; for I := 1 to Length(S) do begin if (S[I] <> '&') or (LastChar = '&') then begin Inc(J); Result[J] := S[I]; end; LastChar := S[I]; end; SetLength(Result, J); end;
Code Constructs that Depend on Character Size
Some operations depend on character size. The functions and features in the following list also include a "portable" version, when possible. You can similarly rewrite your code to be portable, that is, the code works with both AnsiString
and UnicodeString
variables.
SizeOf(<Char array>)
-- use the portableLength(<Char array>)
.Move(<Char buffer>... CharCount)
-- use the portableMove(<Char buffer> ... CharCount * SizeOf(Char))
.- Stream Read/Write -- use the portable
AnsiString
,SizeOf(Char)
, or theTEncoding
class. FillChar(<Char array>, <size>, <AnsiChar>)
-- use*SizeOf(Char)
if filling with#0
, or use the portableStringOfChar
function.GetProcAddress(<module>, <PAnsiChar>)
-- use the provided overload function taking aPWideChar
.- Casting or using
PChar
to do pointer arithmetic -- Place{IFDEF PByte = PChar}
at the top of the file if you usePChar
for pointer arithmetic. Or use the{POINTERMATH <ON|OFF>}
Delphi compiler directive to turn on pointer arithmetic for all typed pointers, so that increment/decrement is by element size.
Set of Char Constructs
You may need to modify these constructs.
- <Char> in <set of
AnsiChar
> -- code generation is correct (>#255
characters are never in the set). The compiler warnsWideChar reduced in set operations
. Depending on your code, you can safely turn off the warning. Alternatively, use theCharinSet
function. - <Char> in
LeadBytes
-- the globalLeadBytes
set is for MBCS ANSI locales. UTF-16 still has the notion of a "lead char" (#$D800 - #$DBFF
are high surrogate,#$DC00 - #$DFFF
are low surrogate). To change this, use the overloaded functionIsLeadChar
. The ANSI version checks againstLeadBytes
. TheWideChar
version checks if it is a high/low surrogate. - Character classification -- use the
TCharacter
static class. TheCharacter
unit offers functions to classify characters:IsDigit
,IsLetter
,IsLetterOrDigit
,IsSymbol
,IsWhiteSpace
,IsSurrogatePair
, and so on. These are based on table data directly from Unicode.org.
Beware of these Constructs
You should examine the following problematic code constructs:
- Casts that obscure the type:
AnsiString(Pointer(foo))
- Review for correctness: what was intended?
- Suspicious casts -- generate a warning:
PChar(<AnsiString var>)
PAnsiChar(<UnicodeString var>)
- Directly constructing, manipulating, or accessing string internal structures. Some, such as
AnsiString
, have changed internally, so this is unsafe. Use theStringRefCount
,StringCodePage
,StringElementSize
, and other functions to get string information.
Runtime Library
- Overloads. For functions that took
PChar
, there are nowPAnsiChar
andPWideChar
versions so the appropriate function gets called. AnsiXXX
functions are a consideration:SysUtils.AnsiXXXX
functions, such asAnsiCompareStr
:- Remain declared with
string
and float toUnicodeString
. - Offer better backward compatibility (no need to change code).
- Remain declared with
- The
AnsiStrings
unit'sAnsiXXXX
functions offer the same capabilities as theSysUtils.AnsiXXXX
functions, but work only forAnsiString
. Also, theAnsiStrings.AnsiXXXX
functions provide better performance for anAnsiString
thanSysUtils.AnsiXXXX
functions, which work for bothAnsiString
andUnicodeString
, because no implicit conversions are performed.
Write/Writeln
andRead/Readln
:
PByte
- declared with$POINTERMATH ON
. This allows array indexing and pointer math likePAnsiChar
.
- String information functions:
StringElementSize
returns the actual data size.StringCodePage
returns the code page of string data.System.StringRefCount
returns the reference count.
- RTL provides helper functions that enable users to do explicit conversions between code pages and element size conversions. If developers are using the
Move
function on a character array, they cannot make assumptions about the element size. Much of this problem can be mitigated by making sure all RValue references generate the proper calls to RTL to ensure proper element sizes.
Components and Classes
TStrings
: StoreUnicodeString
internally (remains declared asstring
).TWideStrings
(may get deprecated) is unchanged. UsesWideString
(BSTR) internally.TStringStream
- Has been rewritten –- defaults to the default ANSI encoding for internal storage.
- Encoding can be overridden.
- Consider using
TStringBuilder
instead ofTStringStream
to construct a string from bits and pieces.
TEncoding
- Defaults to users’ active code page.
- Supports UTF-8.
- Supports UTF-16, big and little endian.
- Byte Order Mark (BOM) support.
- You can create descendent classes for user-specific encodings.
- Component streaming (Text DFM files):
- Are fully backward-compatible.
- Stream as UTF-8 only if component type, property, or name contains non-ASCII-7 characters.
- String property values are still streamed in “#” escaped format.
- May allow values as UTF-8 as well (open issue).
- Only change in binary format is potential for UTF-8 data for component name, properties, and type name.
Byte Order Mark
The Byte Order Mark (BOM) should be added to files to indicate their encoding:
- UTF-8 uses
EF BB BF
. - UTF-16 Little Endian uses
FF FE
. - UTF-16 Big Endian uses
FE FF
.
Steps to Unicode-enable your applications
You need to perform these steps:
- Review char- and string-related functions.
- Rebuild the application.
- Review surrogate pairs.
- Review string payloads.
For more details, see Enabling Your Applications for Unicode.
New Delphi compiler warnings
New warnings have been added to the Delphi compiler related to possible errors in casting types (such as from a UnicodeString
or a WideString
down to an AnsiString
or AnsiChar
). When you are converting an application to Unicode, you should enable warnings 1057 and 1058 to assist in finding problem areas in your code.
- W1057 Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST) Emitted when the compiler detects a case where it must implicitly convert an
AnsiString
(orAnsiChar
) to some form of Unicode (aUnicodeString
or aWideString
). (NOTE: This warning will eventually be enabled by default.) - W1058 Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case were it must implicitly convert some form of Unicode (a
UnicodeString
or aWideString
) down to anAnsiString
(orAnsiChar
). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will eventually be enabled by default.) - W1059: Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST) Emitted when the compiler detects a case where the programmer is explicitly casting an
AnsiString
(orAnsiChar
) to some form of Unicode (UnicodeString
orWideString
). (NOTE: This warning will always be off by default and should only be used to locate potential problems). - W1060 Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS) Emitted when the compiler detects a case where the programmer is explicitly casting some form of Unicode (
UnicodeString
orWideString
) down toAnsiString
(orAnsiChar
). This is a potential lossy conversion, since there may be characters in the string that cannot be represented in the code page to which the string is converted. (NOTE: This warning will always be off by default and should only be used to locate potential problems.)
Recommendations
- Keep source files in UTF-8 format:
- Files can remain ANSI as long as the source is compiled with the correct code page. Select Project > Options > C++ Compiler > Advanced and use the "Code page" option under Other Options to set the correct code page.
- Write a UTF-8 BOM to source file. Make sure your source control management system supports these files (most do).
- Perform IDE refactoring when code must be
AnsiString
orAnsiChar
(code is still portable). - Static code review:
- Is code merely passing the data along?
- Is code doing simple character indexing?
- Heed all warnings (elevate to errors):
- Suspicious pointer casts.
- Implicit/Explicit casts (coming).
- Determine code intent
- Is code using a string (
AnsiString
) as a dynamic array of bytes? If so, use the portableTBytes
type (array ofByte
) instead. - Is a
PChar
cast used to enable pointer arithmetic? If so, cast toPByte
instead and turn$POINTERMATH ON
.
- Is code using a string (
See Also
- UTF-8 Conversion Routines
- Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines, by Cary Jensen
- Delphi in a Unicode World Part I: What is Unicode, Why do you need it, and How do you work with it in Delphi?
- Delphi in a Unicode World Part II: New RTL Features and Classes to Support Unicode
- RAD Studio 2010 Migration Center
- Enabling Applications for Unicode
- Enabling C++ Applications for Unicode
- _TCHAR Mapping (C++)
- Using Unicode in the Command Console
- Using TEncoding for Unicode Files
- How to Handle Delphi AnsiString Code Page Specification in C++
- System.UnicodeString
- System.AnsiString