Enabling Applications for Unicode
Go Up to How To Compile and Build Applications
This topic describes various semantic code constructs you should review in your existing code to ensure that your applications are compatible with the UnicodeString
type. Because Char
now equals WideChar
, and string equals UnicodeString
, previous assumptions about the size in bytes of a character array or string might now be incorrect.
For general information on Unicode, see Unicode in RAD Studio.
Contents
- 1 Setting Up Your Environment for Migrating to Unicode
- 2 Specific Areas to Examine in the Code
- 2.1 Calls to SizeOf
- 2.2 Calls to FillChar
- 2.3 Calls to Move
- 2.4 Calls to Read/ReadBuffer Methods of TStream
- 2.5 Calls to Write/WriteBuffer Methods of TStream
- 2.6 Calls to GetProcAddress
- 2.7 Calls to RegQueryValueEx
- 2.8 Calls to CreateProcessW
- 2.9 Calls to LeadBytes
- 2.10 Calls to TMemoryStream
- 2.11 Calls to MultiByteToWideChar
- 2.12 Calls to SysUtils.AppendStr
- 2.13 Use of Named Threads
- 2.14 Use of PChar Casts to Enable Pointer Arithmetic
- 2.15 Variant Open Array Parameters
- 2.16 Additional Code to Consider
- 3 See Also
Setting Up Your Environment for Migrating to Unicode
Look for any code that:
- Assumes that
SizeOf(Char)
is 1. - Assumes that the
Length
of a string is equal to the number of bytes in the string. - Directly manipulates strings or
PChars
. - Writes and reads strings to or from persistent storage.
The two assumptions listed here first are not true for Unicode, because for Unicode SizeOf(Char)
is greater than 1 byte, and the Length
of a string is half the number of bytes. In addition, code that writes to or reads from persistent storage needs to ensure that the correct number of bytes are being written or read, since a character might no longer be able to be represented as a single byte.
Compiler Flags
Flags have been provided so that you determine whether string
is UnicodeString
or AnsiString
. This can be used to maintain code that supports older versions of Delphi and C++Builder in the same source. For most code that performs standard string operations, it should not be necessary to have separate UnicodeString
and AnsiString
code sections. However, if a procedure performs operations that are dependent upon the internal structure of the string data or that interact with external libraries, it might be necessary to have separate code paths for UnicodeString
and AnsiString
.
Delphi
{$IFDEF UNICODE}
C++
#ifdef _DELPHI_STRING_UNICODE
Compiler Warnings
The Delphi compiler has warnings related to errors in casting types (such as from UnicodeString
or WideString
down to AnsiString
or AnsiChar
). When you convert an application to Unicode, you should enable warnings 1057 and 1058 to find problem areas in your code.
Warning # | Warning Text/Name |
---|---|
Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST) | |
Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS) | |
Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST) | |
Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS) |
- To enable Delphi compiler warnings, go to Project > Options > Compiler Messages.
- To enable C++ compiler warnings, go to Project > Options > C++ Compiler > Warnings.
Specific Areas to Examine in the Code
Calls to SizeOf
Review calls to SizeOf
on character arrays for correctness. Consider the following example:
var Count: Integer; Buffer: array[0..MAX_PATH - 1] of Char; begin // Existing code - incorrect when string = UnicodeString Count := SizeOf(Buffer); GetWindowText(Handle, Buffer, Count); // Correct for Unicode Count := Length(Buffer); // <<-- Count should be chars not bytes GetWindowText(Handle, Buffer, Count); end;
SizeOf
returns the size of the array in bytes, but GetWindowText
expects Count
to be in characters. In this case, Length
should be used instead of SizeOf
. Length
functions similarly with arrays and strings. Length
applied to an array returns the number of array elements allocated to the array; with string types, Length
returns the number of elements in the string.
To find the number of characters contained in a null-terminated string (PAnsiChar
or PWideChar
), use the StrLen
function.
Calls to FillChar
Review calls to FillChar
when used in conjunction with strings or Char
. Consider the following code:
var Count: Integer; Buffer: array[0..255] of Char; begin // Existing code - incorrect when string = UnicodeString (when char = 2 bytes) Count := Length(Buffer); FillChar(Buffer, Count, 0); // Correct for Unicode Count := Length(Buffer) * SizeOf(Char); // <<-- Specify buffer size in bytes FillChar(Buffer, Count, 0); end;
Length
returns the size in elements, but FillChar
expects Count
to be in bytes. In this example, Length
multiplied by the size of Char
should be used. In addition, because the default size of a Char is now 2, FillChar
fills the string with bytes, not Char
as it previously did. For example:
var Buf: array[0..32] of Char; begin FillChar(Buf, Length(Buf), #9); end;
However, this code does not fill the array with code point $09 but code point $0909. To get the expected result, the code needs to be changed to this:
var Buf: array[0..32] of Char; begin StrPCopy(Buf, StringOfChar(#9, Length(Buf))); ... end;
Calls to Move
Review calls to Move
with strings or character arrays, as in the following example:
var Count: Integer; Buf1, Buf2: array[0..255] of Char; begin // Existing code - incorrect when string = UnicodeString (when char = 2 bytes) Count := Length(Buf1); Move(Buf1, Buf2, Count); // Correct for Unicode Count := Length(Buf1) * SizeOf(Char); // <<-- Specify buffer size in bytes Move(Buf1, Buf2, Count); end;
Length
returns the size in elements, but Move
expects <code">Count to be in bytes. In this case, Length
multiplied by the size of Char
should be used.
Calls to Read/ReadBuffer Methods of TStream
Review calls to TStream.Read/ReadBuffer
when strings or character arrays are used. Consider the following example:
var S: string; L: Integer; Stream: TStream; Temp: AnsiString; begin // Existing code - incorrect when string = UnicodeString Stream.Read(L, SizeOf(Integer)); SetLength(S, L); Stream.Read(Pointer(S)^, L); // Correct for Unicode string data Stream.Read(L, SizeOf(Integer)); SetLength(S, L); Stream.Read(Pointer(S)^, L * SizeOf(Char)); // <<-- Specify buffer size in bytes // Correct for Ansi string data Stream.Read(L, SizeOf(Integer)); SetLength(Temp, L); // <<-- Use temporary AnsiString Stream.Read(Pointer(Temp)^, L * SizeOf(AnsiChar)); // <<-- Specify buffer size in bytes S := Temp; // <<-- Widen string to Unicode end;
The solution depends on the format of the data being read. Use the TEncoding class to assist you in properly encoding stream text.
Calls to Write/WriteBuffer Methods of TStream
Review calls to TStream.Write/WriteBuffer
when strings or character arrays are used. Consider the following example:
var S: string; Stream: TStream; Temp: AnsiString; L: Integer; begin L := Length(S); // Existing code // Incorrect when string = UnicodeString Stream.Write(L, SizeOf(Integer)); // Write string length Stream.Write(Pointer(S)^, Length(S)); // Correct for Unicode data Stream.Write(L, SizeOf(Integer)); Stream.Write(Pointer(S)^, Length(S) * SizeOf(Char)); // <<-- Specify buffer size in bytes // Correct for Ansi data Stream.Write(L, SizeOf(Integer)); Temp := S; // <<-- Use temporary AnsiString Stream.Write(Pointer(Temp)^, Length(Temp) * SizeOf(AnsiChar));// <<-- Specify buffer size in bytes end;
The proper code depends on the format of the data being written. Use the TEncoding
class to assist you in properly encoding stream text.
Calls to GetProcAddress
Calls to the Windows API function GetProcAddress
should always use PAnsiChar
, since there is no analogous wide function in the Windows API. This example shows the correct usage:
procedure CallLibraryProc(const LibraryName, ProcName: string); var Handle: THandle; RegisterProc: function: HResult stdcall; begin Handle := LoadOleControlLibrary(LibraryName, True); @RegisterProc := GetProcAddress(Handle, PAnsiChar(AnsiString(ProcName))); end;
Calls to RegQueryValueEx
In RegQueryValueEx
, the Len parameter receives and returns the number of bytes, not characters. The Unicode version thus requires twice as large value for the Len parameter.
Here is a sample RegQueryValueEx
call:
Len := MAX_PATH; if RegQueryValueEx(reg, PChar(Name), nil, nil, PByte(@Data[0]), @Len) = ERROR_SUCCESS then SetString(Result, Data, Len - 1) // Len includes #0 else RaiseLastOSError;
This must be changed to this:
Len := MAX_PATH * SizeOf(Char); if RegQueryValueEx(reg, PChar(Name), nil, nil, PByte(@Data[0]), @Len) = ERROR_SUCCES then SetString(Result, Data, Len div SizeOf(Char) - 1) // Len includes #0, Len contains the number of bytes else RaiseLastOSError;
Calls to CreateProcessW
The Unicode version of the Windows API function CreateProcess
, CreateProcessW
, behaves slightly differently than the ANSI version. To quote MSDN in reference to the lpCommandLine
parameter:
"The Unicode version of this function, CreateProcessW, can modify the contents of this string. Therefore, this parameter cannot be a pointer to read-only memory (such as a const variable or a literal string). If this parameter is a constant string, the function might cause an access violation."
Because of this problem, existing code that calls CreateProcess
might cause access violations.
Here are examples of such problematic code:
// Passing in a string constant CreateProcess(nil, 'foo.exe', nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo); // Passing in a constant expression const cMyExe = 'foo.exe' CreateProcess(nil, cMyExe, nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo); // Passing in a string whose refcount is -1: const cMyExe = 'foo.exe' var sMyExe: string; sMyExe := cMyExe; CreateProcess(nil, PChar(sMyExe), nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo);
Calls to LeadBytes
Previously, LeadBytes
listed all values that could be the first byte of a double byte character on the local system. Replace code like this:
if Str[I] in LeadBytes then
with a call to the IsLeadChar
function:
if IsLeadChar(Str[I]) then
Calls to TMemoryStream
In cases where a TMemoryStream
is used to write a text file, it is useful to write a Byte Order Mark (BOM) before writing anything else to the file. Here is an example of writing the BOM to a file:
var Bom: TBytes; begin tms: TMemoryStream; ... Bom := TEncoding.UTF8.GetPreamble; tms.Write(Bom[0], Length(Bom));
Any code that writes to a file needs to be changed to UTF-8 encode the Unicode string:
var Temp: Utf8String; begin tms: TMemoryStream; ... Temp := Utf8Encode(Str); // Str is string being written to file tms.Write(Pointer(Temp)^, Length(Temp)); //Write(Pointer(Str)^, Length(Str)); original call to write string to file
Calls to MultiByteToWideChar
Calls to the Windows API function MultiByteToWideChar
can simply be replaced with an assignment. An example using MultiByteToWideChar
:
procedure TWideCharStrList.AddString(const S: string); var Size, D: Integer; begin Size := Length(S); D := (Size + 1) * SizeOf(WideChar); FList[FUsed] := AllocMem(D); MultiByteToWideChar(0, 0, PChar(S), Size, FList[FUsed], D); Inc(FUsed); end;
After the change to Unicode, this call was changed to support compiling under both ANSI and Unicode:
procedure TWideCharStrList.AddString(const S: string); {$IFNDEF UNICODE} var L, D: Integer; {$ENDIF} begin {$IFDEF UNICODE} FList[FUsed] := StrNew(PWideChar(S)); {$ELSE} L := Length(S); D := (L + 1) * SizeOf(WideChar); FList[FUsed] := AllocMem(D); MultiByteToWideChar(0, 0, PAnsiChar(S), L, FList[FUsed], D); {$ENDIF} Inc(FUsed); end;
Calls to SysUtils.AppendStr
AppendStr
is deprecated and is hard-coded to use AnsiString
, and no UnicodeString
overload is available. Replace calls like this:
AppendStr(String1, String2);
with code like this:
String1 := String1 + String2;
You can also use the new TStringBuilder
class.
Use of Named Threads
Existing Delphi code that uses named threads must change. In previous versions, when you used the new Thread Object item in the gallery to create a new thread, it created the following type declaration in the new thread's unit:
type TThreadNameInfo = record FType: LongWord; // must be 0x1000 FName: PChar; // pointer to name (in user address space) FThreadID: LongWord; // thread ID (-1 indicates caller thread) FFlags: LongWord; // reserved for future use, must be zero end;
The debugger's named thread handler expects the FName
member to be ANSI data, not Unicode, so the above declaration needs to be changed to the following:
type TThreadNameInfo = record FType: LongWord; // must be 0x1000 FName: PAnsiChar; // pointer to name (in user address space) FThreadID: LongWord; // thread ID (-1 indicates caller thread) FFlags: LongWord; // reserved for future use, must be zero end;
New named threads are created with the updated type declaration. Only code that was created in a previous Delphi version needs to be manually updated.
If you want to use Unicode characters or strings in a thread name, you must encode the string in UTF-8 for the debugger to handle it properly. For instance:
ThreadNameInfo.FName := UTF8String('UnicodeThread_фис');
Note: C++Builder thread objects have always used the correct type, so this is not an issue in C++Builder code.
Use of PChar Casts to Enable Pointer Arithmetic
In versions prior to 2009, not all Pointer types supported pointer arithmetic. Because of this, the practice of casting various non-char pointers to PChar
was used to enable pointer arithmetic. Now, enable pointer arithmetic by using the new $POINTERMATH
compiler directive, which is specifically enabled for the PByte
type.
Here is an example of code that casts pointer data to PChar
for the purpose of performing pointer arithmetic on it:
function TCustomVirtualStringTree.InternalData(Node: PVirtualNode): Pointer; begin if (Node = FRoot) or (Node = nil) then Result := nil else Result := PChar(Node) + FInternalDataOffset; end;
You should change this to use PByte
rather than PChar
:
function TCustomVirtualStringTree.InternalData(Node: PVirtualNode): Pointer; begin if (Node = FRoot) or (Node = nil) then Result := nil else Result := PByte(Node) + FInternalDataOffset; end;
In the above sample, Node
is not actually character data. It is cast to a PChar
to use pointer arithmetic to access data that is a certain number of bytes after Node
. This worked previously, because SizeOf(Char)
equalled Sizeof(Byte)
. This is no longer true, so such code needs to be changed to use PByte
rather than PChar
. Without this change, Result
points to incorrect data.
Variant Open Array Parameters
If you have code that uses TVarRec
to handle variant open array parameters, you might need to augment it to handle UnicodeString
. A new type vtUnicodeString
is defined for UnicodeString
. The UnicodeString
data is in type vtUnicodeString
. The following sample shows a case where new code has been added to handle the UnicodeString
type.
procedure RegisterPropertiesInCategory(const CategoryName: string; const Filters: array of const); overload; var I: Integer; begin if Assigned(RegisterPropertyInCategoryProc) then for I := Low(Filters) to High(Filters) do with Filters[I] do case vType of vtPointer: RegisterPropertyInCategoryProc(CategoryName, nil, PTypeInfo(vPointer), ); vtClass: RegisterPropertyInCategoryProc(CategoryName, vClass, nil, ); vtAnsiString: RegisterPropertyInCategoryProc(CategoryName, nil, nil, string(vAnsiString)); vtUnicodeString: RegisterPropertyInCategoryProc(CategoryName, nil, nil, string(vUnicodeString)); else raise Exception.CreateResFmt(@sInvalidFilter, [I, vType]); end; end;
Additional Code to Consider
Search for the following additional code constructs to locate Unicode enabling problems:
AllocMem
- <code">AnsiChar
of AnsiChar
AnsiString
of Char
Copy
GetMem
Length
PAnsiChar
Pointer
Seek
ShortString
string
Code containing such constructs might need to be changed to properly support the UnicodeString
type.