How to fix these two warnings about implicit string cast during charset conversion?

123 views Asked by At

I wrote a tool that merge different text files (the files are small). Files can be ANSI (Latin1), UTF-8 with or without BOM. For files with a BOM Delphi detects correctly the charset of the file but for files without a BOM I must do some hackery to detect the charset (see GetFileCharset).

In the following Delphi code, I get 2 warnings (see comments at the end of the concerned lines):

uses
    WideStrUtils;
    

function GetFileCharset(const Filename: String): TEncoding;
var
    StreamReader: TStreamReader;
    FallbackEncoding: TEncoding;
    CurrLine: String;

begin
    FallbackEncoding := TEncoding.ANSI;

    try
        StreamReader := TStreamReader.Create(Filename, FallbackEncoding, True);
        try
            Result := StreamReader.CurrentEncoding;

            if StreamReader.CurrentEncoding = FallbackEncoding then
            begin
                while not StreamReader.EndOfStream do
                begin
                    CurrLine := StreamReader.ReadLine;
                    if IsUTF8String(CurrLine) then //[dcc32 Warning]: W1058 Implicit string cast with potential data loss from 'string' to 'RawByteString'
                    begin
                        Result := TEncoding.UTF8;
                        break;
                    end;
                end;
            end;
        finally
            StreamReader.Close;
            StreamReader.Free;
        end;
    except on E : Exception do
        Result := FallbackEncoding;
    end;
end;


StreamWriter := TStreamWriter.Create(OutputFile, False, TEncoding.UTF8);
try
    StreamReader := TStreamReader.Create(InputFile, GetFileCharset(CurrFile), True);
    try
        while not StreamReader.EndOfStream do
            StreamWriter.WriteLine(UTF8Encode(StreamReader.ReadLine)); //[dcc32 Warning]: W1057 Implicit string cast from 'RawByteString' to 'string'
    finally
        StreamReader.Close;
        StreamReader.Free;
    end;
finally
    StreamWriter.Close;
    StreamWriter.Free;
end;

#1 For the Implicit string cast warning I can easily do:

StreamWriter.WriteLine(String(UTF8Encode(StreamReader.ReadLine)));

but I'm wondering if there is a better way or if there is potential danger here?

#2 For the Implicit string cast with potential data loss, I'm not sure how to safely fix this.

#3 Is there a better way to detect the file charset over what I did?

1

There are 1 answers

0
Remy Lebeau On

For the 1st warning:

StreamReader.ReadLine() returns a UTF-16 UnicodeString that has already been charset-decoded using the reader's assigned Encoding (which in your case will always be TEncoding.ANSI). So any encoding details about a line have already been lost before you ever see that line.

IsUTF8String() takes in a RawByteString and returns whether its 8-bit characters are encoded in UTF-8 or not. It is useless for a 16-bit UnicodeString.

You are getting an implicit conversion when calling IsUTF8String() with a 16-bit string instead of an 8-bit string. IsUTF8String() will not return True in your situation, as a UnicodeString-to-RawByteString conversion will not produce a UTF-8 string (unless you manually set System.DefaultSystemCodePage to 65001 aka CP_UTF8 beforehand). So, you can simply get rid of this test altogether from your code.

For what you are attempting, you need to analyze the raw bytes of the file, not the decoded characters from StreamReader.ReadLine(). Unfortunately, the RTL does not provide a class that can read a file line-by-line as 8-bit strings, so you are going to have to read and parse the file bytes yourself.

Also, ASCII is a subset of both ANSI and UTF-8, ASCII characters are encoded the exact same in both, so you would need to analyze the whole file (or at least until you encounter a non-ASCII character) in order to determine the actual charset. Even then, TEncoding.ANSI will only match the OS user's default locale, so you may end up not properly detecting the file's real encoding if it is using a differnet non-UTF locale.

You are best off using a pre-existing 3rd party library that handles this kind of detection for you.


For the 2nd warning:

StreamWriter.WriteLine() takes in a UTF-16 UnicodeString as input, but you are passing it a UTF-8 RawByteString instead. Your workaround just makes that conversion explicit, but doesn't change the outcome. While this is a loss-less conversion, you don't actually need the UTF8Encode() at all. Just give StreamWriter the UnicodeString as-is and let it handle the conversion to UTF-8 for you. After all, that is what you asked it to do, by giving it TEncoding.UTF8 in its constructor.