Using TEncoding for Unicode Files

From RAD Studio
Jump to: navigation, search

Go Up to VCL

Delphi and C++Builder moved to Unicode as the default string type beginning with the 2009 product version.

Reading and Writing in the Old Format

Many Delphi applications need to continue to interact with other applications or datasources, many of which can only handle data in ANSI or ASCII. For this reason, the defaults for methods of the TStrings class write the ANSI-encoded files (based on the active code page) and read the files based on whether or not the files contain a Byte Order Mark (BOM)

If a BOM is found, a method of TStrings reads the data encoded as the BOM indicates. If no BOM is found, the method reads the data as ANSI and up-converts it based on the current active code page.

All your files written with versions of Delphi that are prior to RAD Studio 2009 are still read in, with the condition that you must use the same active code page for both reading and writing. Likewise, any file written with RAD Studio 2009 that has an ASCII encoding should be readable with any version prior to RAD Studio 2009.

Any file written with RAD Studio 2009 that uses any other encoding generates a BOM and will not be readable with a version prior to RAD Studio 2009. At this point, only the most common BOM formats are detected (UTF16 Little-Endian, UTF16 Big-Endian, and UTF8).

Using the New Encodings

You may want to read or write text data using the TStrings class in Unicode format, be that Little-Endian UTF16, Big-Endian UTF16, UTF8, UTF7, and so on. The TEncoding class is very similar in methods and functionality to the System.Text.Encoding class in the .NET Framework.

  S: TStrings;
  S := TStringList.Create();
  { ... }
  S.SaveToFile('config.txt', TEncoding.UTF8);

Without the extra TEncoding.UTF8 parameter, 'config.txt' would simply be converted and written out as ANSI-encoded based on the current active code page. You do not need to change the read code, because TStrings automatically detects the encoding based on the BOM and acts accordingly.

If you want to force the file to read and write using a specific code page, you can create an instance of TMBCSEncoding and pass the code page you want to use into the constructor. Then you use that instance to read and write the file, because the specific code page might not match the user's active code page.

The same holds true for these classes—the data is read and written as ANSI data. Because INI files have always been traditionally ANSI(ASCII)-encoded, it might not make sense to convert these, depending on the needs of your application. If you do want to use a Unicode format, we offer ways to use the TEncoding class to accomplish that as well.

In all the above cases, the internal storage is Unicode, and any data manipulation you do with strings should continue to function as expected. Conversions automatically happen when reading and writing the data.

Here is a list of codepage identifiers (MSDN).

The following list shows the overload methods that accept a TEncoding parameter:

See Also