Enabling Applications for Unicode

From RAD Studio
Jump to: navigation, search

Go Up to How To Compile and Build Applications


This topic describes various semantic code constructs you should review in your existing code to ensure that your applications are compatible with the UnicodeString type. Because Char now equals WideChar, and string equals UnicodeString, previous assumptions about the size in bytes of a character array or string might now be incorrect.

For general information on Unicode, see Unicode in RAD Studio.

Setting Up Your Environment for Migrating to Unicode

Look for any code that:

  • Assumes that SizeOf(Char) is 1.
  • Assumes that the Length of a string is equal to the number of bytes in the string.
  • Directly manipulates strings or PChars.
  • Writes and reads strings to or from persistent storage.

The two assumptions listed here first are not true for Unicode, because for Unicode SizeOf(Char) is greater than 1 byte, and the Length of a string is half the number of bytes. In addition, code that writes to or reads from persistent storage needs to ensure that the correct number of bytes are being written or read, since a character might no longer be able to be represented as a single byte.

Compiler Flags

Flags have been provided so that you determine whether string is UnicodeString or AnsiString. This can be used to maintain code that supports older versions of Delphi and C++Builder in the same source. For most code that performs standard string operations, it should not be necessary to have separate UnicodeString and AnsiString code sections. However, if a procedure performs operations that are dependent upon the internal structure of the string data or that interact with external libraries, it might be necessary to have separate code paths for UnicodeString and AnsiString.

Delphi

{$IFDEF UNICODE}

C++

 #ifdef _DELPHI_STRING_UNICODE

Compiler Warnings

The Delphi compiler has warnings related to errors in casting types (such as from UnicodeString or WideString down to AnsiString or AnsiChar). When you convert an application to Unicode, you should enable warnings 1057 and 1058 to find problem areas in your code.

Warning # Warning Text/Name

Error 1057

Implicit string cast from '%s' to '%s' (IMPLICIT_STRING_CAST)

Error 1058

Implicit string cast with potential data loss from '%s' to '%s' (IMPLICIT_STRING_CAST_LOSS)

Error 1059

Explicit string cast from '%s' to '%s' (EXPLICIT_STRING_CAST)

Error 1060

Explicit string cast with potential data loss from '%s' to '%s' (EXPLICIT_STRING_CAST_LOSS)


  • To enable Delphi compiler warnings, go to Project > Options > Compiler Messages.
  • To enable C++ compiler warnings, go to Project > Options > C++ Compiler > Warnings.

Specific Areas to Examine in the Code

Calls to SizeOf

Review calls to SizeOf on character arrays for correctness. Consider the following example:


var
  Count: Integer;
  Buffer: array[0..MAX_PATH - 1] of Char;
begin
  // Existing code - incorrect when string = UnicodeString
  Count := SizeOf(Buffer);
  GetWindowText(Handle, Buffer, Count);

  // Correct for Unicode
  Count := Length(Buffer); // <<-- Count should be chars not bytes
  GetWindowText(Handle, Buffer, Count);
end;

SizeOf returns the size of the array in bytes, but GetWindowText expects Count to be in characters. In this case, Length should be used instead of SizeOf. Length functions similarly with arrays and strings. Length applied to an array returns the number of array elements allocated to the array; with string types, Length returns the number of elements in the string.

To find the number of characters contained in a null-terminated string (PAnsiChar or PWideChar), use the StrLen function.

Calls to FillChar

Review calls to FillChar when used in conjunction with strings or Char. Consider the following code:

var
  Count: Integer;
  Buffer: array[0..255] of Char;
begin
   // Existing code - incorrect when string = UnicodeString (when char = 2 bytes)
   Count := Length(Buffer);
   FillChar(Buffer, Count, 0);

   // Correct for Unicode
   Count := Length(Buffer) * SizeOf(Char); // <<-- Specify buffer size in bytes
   FillChar(Buffer, Count, 0);
end;

Length returns the size in elements, but FillChar expects Count to be in bytes. In this example, Length multiplied by the size of Char should be used. In addition, because the default size of a Char is now 2, FillChar fills the string with bytes, not Char as it previously did. For example:


var
  Buf: array[0..32] of Char;
begin
  FillChar(Buf, Length(Buf), #9);
end;

However, this code does not fill the array with code point $09 but code point $0909. To get the expected result, the code needs to be changed to this:

var
  Buf: array[0..32] of Char;
begin
  StrPCopy(Buf, StringOfChar(#9, Length(Buf)));
...
end;

Calls to Move

Review calls to Move with strings or character arrays, as in the following example:


var
   Count: Integer;
   Buf1, Buf2: array[0..255] of Char;
begin
  // Existing code - incorrect when string = UnicodeString (when char = 2 bytes)
  Count := Length(Buf1);
  Move(Buf1, Buf2, Count);

  // Correct for Unicode
  Count := Length(Buf1) * SizeOf(Char); // <<-- Specify buffer size in bytes
  Move(Buf1, Buf2, Count);
end;

Length returns the size in elements, but Move expects <code">Count to be in bytes. In this case, Length multiplied by the size of Char should be used.

Calls to Read/ReadBuffer Methods of TStream

Review calls to TStream.Read/ReadBuffer when strings or character arrays are used. Consider the following example:


var
  S: string;
  L: Integer;
  Stream: TStream;
  Temp: AnsiString;
begin
  // Existing code - incorrect when string = UnicodeString
  Stream.Read(L, SizeOf(Integer));
  SetLength(S, L);
  Stream.Read(Pointer(S)^, L);

  // Correct for Unicode string data
  Stream.Read(L, SizeOf(Integer));
  SetLength(S, L);
  Stream.Read(Pointer(S)^, L * SizeOf(Char));  // <<-- Specify buffer size in bytes

  // Correct for Ansi string data
  Stream.Read(L, SizeOf(Integer));
  SetLength(Temp, L);              // <<-- Use temporary AnsiString
  Stream.Read(Pointer(Temp)^, L * SizeOf(AnsiChar));  // <<-- Specify buffer size in bytes
  S := Temp;                       // <<-- Widen string to Unicode
end;

The solution depends on the format of the data being read. Use the TEncoding class to assist you in properly encoding stream text.

Calls to Write/WriteBuffer Methods of TStream

Review calls to TStream.Write/WriteBuffer when strings or character arrays are used. Consider the following example:


var
  S: string;
  Stream: TStream;
  Temp: AnsiString;
  L: Integer;
begin
  L := Length(S);
  
  // Existing code
  // Incorrect when string = UnicodeString
  Stream.Write(L, SizeOf(Integer)); // Write string length
  Stream.Write(Pointer(S)^, Length(S));
  
  // Correct for Unicode data
  Stream.Write(L, SizeOf(Integer));
  Stream.Write(Pointer(S)^, Length(S) * SizeOf(Char)); // <<-- Specify buffer size in bytes
  
  // Correct for Ansi data
  Stream.Write(L, SizeOf(Integer));
  Temp := S;          // <<-- Use temporary AnsiString
  Stream.Write(Pointer(Temp)^, Length(Temp) * SizeOf(AnsiChar));// <<-- Specify buffer size in bytes
end;

The proper code depends on the format of the data being written. Use the TEncoding class to assist you in properly encoding stream text.

Calls to GetProcAddress

Calls to the Windows API function GetProcAddress should always use PAnsiChar, since there is no analogous wide function in the Windows API. This example shows the correct usage:


procedure CallLibraryProc(const LibraryName, ProcName: string);
var
  Handle: THandle;
  RegisterProc: function: HResult stdcall;
begin
  Handle := LoadOleControlLibrary(LibraryName, True);
  @RegisterProc := GetProcAddress(Handle, PAnsiChar(AnsiString(ProcName)));
end;

Calls to RegQueryValueEx

In RegQueryValueEx, the Len parameter receives and returns the number of bytes, not characters. The Unicode version thus requires twice as large value for the Len parameter.

Here is a sample RegQueryValueEx call:

Len := MAX_PATH;
if RegQueryValueEx(reg, PChar(Name), nil, nil, PByte(@Data[0]), @Len) = ERROR_SUCCESS
then
  SetString(Result, Data, Len - 1) // Len includes #0
else
  RaiseLastOSError;


This must be changed to this:


Len := MAX_PATH * SizeOf(Char);
if RegQueryValueEx(reg, PChar(Name), nil, nil, PByte(@Data[0]), @Len) = ERROR_SUCCES
then
  SetString(Result, Data, Len div SizeOf(Char) - 1) // Len includes #0, Len contains the number of bytes
else
  RaiseLastOSError;

Calls to CreateProcessW

The Unicode version of the Windows API function CreateProcess, CreateProcessW, behaves slightly differently than the ANSI version. To quote MSDN in reference to the lpCommandLine parameter:

"The Unicode version of this function, CreateProcessW, can modify the contents of this string. Therefore, this parameter cannot be a pointer to read-only memory (such as a const variable or a literal string). If this parameter is a constant string, the function might cause an access violation."

Because of this problem, existing code that calls CreateProcess might cause access violations.

Here are examples of such problematic code:


// Passing in a string constant
CreateProcess(nil, 'foo.exe', nil, nil, False, 0,
  nil, nil, StartupInfo, ProcessInfo);
// Passing in a constant expression
  const
    cMyExe = 'foo.exe'
  CreateProcess(nil, cMyExe, nil, nil, False, 0,
    nil, nil, StartupInfo, ProcessInfo);
// Passing in a string whose refcount is -1:
const
  cMyExe = 'foo.exe'
var
  sMyExe: string;
  sMyExe := cMyExe;
  CreateProcess(nil, PChar(sMyExe), nil, nil, False, 0, nil, nil, StartupInfo, ProcessInfo);

Calls to LeadBytes

Previously, LeadBytes listed all values that could be the first byte of a double byte character on the local system. Replace code like this:

if Str[I] in LeadBytes then

with a call to the IsLeadChar function:

if IsLeadChar(Str[I]) then

Calls to TMemoryStream

In cases where a TMemoryStream is used to write a text file, it is useful to write a Byte Order Mark (BOM) before writing anything else to the file. Here is an example of writing the BOM to a file:


var
  Bom: TBytes;
begin
  tms: TMemoryStream;
  ...
  Bom := TEncoding.UTF8.GetPreamble;
  tms.Write(Bom[0], Length(Bom));

Any code that writes to a file needs to be changed to UTF-8 encode the Unicode string:


var
  Temp: Utf8String;
begin
  tms: TMemoryStream;
  ...
  Temp := Utf8Encode(Str); // Str is string being written to file
  tms.Write(Pointer(Temp)^, Length(Temp));
 //Write(Pointer(Str)^, Length(Str)); original call to write string to file

Calls to MultiByteToWideChar

Calls to the Windows API function MultiByteToWideChar can simply be replaced with an assignment. An example using MultiByteToWideChar:


procedure TWideCharStrList.AddString(const S: string);
var
  Size, D: Integer;
begin
  Size := Length(S);
  D := (Size + 1) * SizeOf(WideChar);
  FList[FUsed] := AllocMem(D);
  MultiByteToWideChar(0, 0, PChar(S), Size, FList[FUsed], D);
  Inc(FUsed);
end;

After the change to Unicode, this call was changed to support compiling under both ANSI and Unicode:


procedure TWideCharStrList.AddString(const S: string);
{$IFNDEF UNICODE}
var
  L, D: Integer;
{$ENDIF}
begin
{$IFDEF UNICODE}
  FList[FUsed] := StrNew(PWideChar(S));
{$ELSE}
  L := Length(S);
  D := (L + 1) * SizeOf(WideChar);
  FList[FUsed] := AllocMem(D);
  MultiByteToWideChar(0, 0, PAnsiChar(S), L, FList[FUsed], D);
{$ENDIF}
  Inc(FUsed);
end;

Calls to SysUtils.AppendStr

AppendStr is deprecated and is hard-coded to use AnsiString, and no UnicodeString overload is available. Replace calls like this:

AppendStr(String1, String2);

with code like this:

 String1 := String1 + String2;

You can also use the new TStringBuilder class.

Use of Named Threads

Existing Delphi code that uses named threads must change. In previous versions, when you used the new Thread Object item in the gallery to create a new thread, it created the following type declaration in the new thread's unit:

type
TThreadNameInfo = record
  FType: LongWord; // must be 0x1000
  FName: PChar; // pointer to name (in user address space)
  FThreadID: LongWord; // thread ID (-1 indicates caller thread)
  FFlags: LongWord; // reserved for future use, must be zero
end;


The debugger's named thread handler expects the FName member to be ANSI data, not Unicode, so the above declaration needs to be changed to the following:


type
TThreadNameInfo = record
  FType: LongWord; // must be 0x1000
  FName: PAnsiChar; // pointer to name (in user address space)
  FThreadID: LongWord; // thread ID (-1 indicates caller thread)
  FFlags: LongWord; // reserved for future use, must be zero
end;

New named threads are created with the updated type declaration. Only code that was created in a previous Delphi version needs to be manually updated.

If you want to use Unicode characters or strings in a thread name, you must encode the string in UTF-8 for the debugger to handle it properly. For instance:

ThreadNameInfo.FName := UTF8String('UnicodeThread_фис');

Note: C++Builder thread objects have always used the correct type, so this is not an issue in C++Builder code.

Use of PChar Casts to Enable Pointer Arithmetic

In versions prior to 2009, not all Pointer types supported pointer arithmetic. Because of this, the practice of casting various non-char pointers to PChar was used to enable pointer arithmetic. Now, enable pointer arithmetic by using the new $POINTERMATH compiler directive, which is specifically enabled for the PByte type.

Here is an example of code that casts pointer data to PChar for the purpose of performing pointer arithmetic on it:


function TCustomVirtualStringTree.InternalData(Node: PVirtualNode): Pointer;
begin
  if (Node = FRoot) or (Node = nil) then
    Result := nil
  else
    Result := PChar(Node) + FInternalDataOffset;
end;

You should change this to use PByte rather than PChar:


function TCustomVirtualStringTree.InternalData(Node: PVirtualNode): Pointer;
begin
  if (Node = FRoot) or (Node = nil) then
    Result := nil
  else
    Result := PByte(Node) + FInternalDataOffset;
end;

In the above sample, Node is not actually character data. It is cast to a PChar to use pointer arithmetic to access data that is a certain number of bytes after Node. This worked previously, because SizeOf(Char) equalled Sizeof(Byte). This is no longer true, so such code needs to be changed to use PByte rather than PChar. Without this change, Result points to incorrect data.

Variant Open Array Parameters

If you have code that uses TVarRec to handle variant open array parameters, you might need to augment it to handle UnicodeString. A new type vtUnicodeString is defined for UnicodeString. The UnicodeString data is in type vtUnicodeString. The following sample shows a case where new code has been added to handle the UnicodeString type.


procedure RegisterPropertiesInCategory(const CategoryName: string;
  const Filters: array of const); overload;
var
I: Integer;
begin
  if Assigned(RegisterPropertyInCategoryProc) then
    for I := Low(Filters) to High(Filters) do
      with Filters[I] do
        case vType of
          vtPointer:
            RegisterPropertyInCategoryProc(CategoryName, nil,
              PTypeInfo(vPointer), );
          vtClass:
            RegisterPropertyInCategoryProc(CategoryName, vClass, nil, );
          vtAnsiString:
            RegisterPropertyInCategoryProc(CategoryName, nil, nil,
              string(vAnsiString));
          vtUnicodeString:
            RegisterPropertyInCategoryProc(CategoryName, nil, nil,
              string(vUnicodeString));
        else
          raise Exception.CreateResFmt(@sInvalidFilter, [I, vType]);
        end;
 end;

Additional Code to Consider

Search for the following additional code constructs to locate Unicode enabling problems:

  • AllocMem
  • <code">AnsiChar
  • of AnsiChar
  • AnsiString
  • of Char
  • Copy
  • GetMem
  • Length
  • PAnsiChar
  • Pointer
  • Seek
  • ShortString
  • string

Code containing such constructs might need to be changed to properly support the UnicodeString type.

See Also