With Delphi 6/7, how can I convert an AnsiString in a different CharSet, to hex String UTF-8?

1k views Asked by At

I need to draw a barcode (QR) with Delphi 6/7. The program can run in various windows locales, and the data is from an input box.

On this input box, the user can choose a charset, and input his own language. This works fine. The input data is only ever from the same codepage. Example configurations could be:

  • Windows is on Western Europe, Codepage 1252 for ANSI text
  • Input is done in Shift-JIS ANSI charset

I need to get the Shift-JIS across to the barcode. The most robust way is to use hex encoding.

So my question is: how do I go from Shift-JIS to a hex String in UTF-8 encoding, if the codepage is not the same as the Windows locale?

As example: I have the string 能ラ. This needs to be converted to E883BDE383A9 as per UTF-8. I have tried this but the result is different and meaningless:

String2Hex(UTF8Encode(ftext))

Unfortunately I can't just have an inputbox for WideStrings. But if I can find a way to convert the ANSI text to a WideString, the barcode module can work with Unicode Strings as well.

If it's relevant: I am using the TEC-IT TBarcode DLL.

1

There are 1 answers

2
AmigoJack On BEST ANSWER

Creating and accessing a Unicode text control

This is easier than you may think and I did so in the past with the brand new Windows 2000 when convenient components like Tnt Delphi Unicode Controls were not available. Having background knowledge on how to create a Windows GUI program without using Delphi's VCL and manually creating everything helps - otherwise this is also an introduction of it.

  1. First add a property to your form, so we can later access the new control easily:

    type
      TForm1= class(TForm)
    ...
      private
        hEdit: THandle;  // Our new Unicode control
      end;
    
  2. Now just create it at your favorite event - I chose FormCreate:

      // Creating a child control, type "edit"
      self.hEdit:= CreateWindowW( PWideChar(WideString('edit')), PWideChar(WideString('myinput')), WS_CHILD or WS_VISIBLE, 10, 10, 200, 25, Handle, 0, HINSTANCE, nil );
      if self.hEdit= 0 then begin  // Failed. Get error code so we know why it failed.
        //GetLastError();
        exit;
      end;
    
      // Add a sunken 3D edge (well, historically speaking)
      if SetWindowLong( self.hEdit, GWL_EXSTYLE, WS_EX_CLIENTEDGE )= 0 then begin
        //GetLastError();
        exit;
      end;
    
      // Applying new extended style: the control's frame has changed
      if not SetWindowPos( self.hEdit, 0, 0, 0, 0, 0, SWP_FRAMECHANGED or SWP_NOMOVE or SWP_NOZORDER or SWP_NOSIZE ) then begin
        //GetLastError();
        exit;
      end;
    
      // The system's default font is no help, let's use this form's font (hopefully Tahoma)
      SendMessage( self.hEdit, WM_SETFONT, self.Font.Handle, 1 );
    
  3. At some point you want to get the edit's content. Again: how is this done without Delphi's VCL but instead directly with the WinAPI? This time I used a button's Click event:

    var
      sText: WideString;
      iLen, iError: Integer;
    begin
      // How many CHARACTERS to copy?
      iLen:= GetWindowTextLengthW( self.hEdit );
      if iLen= 0 then iError:= GetLastError() else iError:= 0;  // Could be empty, could be an error
      if iError<> 0 then begin
        exit;
      end;
    
      Inc( iLen );  // For a potential trailing #0
      SetLength( sText, iLen );  // Reserve space
      if GetWindowTextW( self.hEdit, @sText[1], iLen )= 0 then begin  // Copy text
        //GetLastError();
        exit;
      end;
    
      // Demonstrate that non-ANSI text was copied out of a non-ANSI control
      MessageBoxW( Handle, PWideChar(sText), nil, 0 );
    end;
    

There are detail issues, like not being able to reach this new control via Tab, but we're already basically re-inventing Delphi's VCL, so those are details to take care about at other times.

Converting codepages

The WinAPI deals either in codepages (Strings) or in UTF-16 LE (WideStrings). For historical reasons (UCS-2 and later) UTF-16 LE fits everything, so this is always the implied target to achieve when coming from codepages:

// Converting an ANSI charset (String) to UTF-16 LE (Widestring)
function StringToWideString( s: AnsiString; iSrcCodePage: DWord ): WideString;
var
  iLenDest, iLenSrc: Integer;
begin
  iLenSrc:= Length( s );
  iLenDest:= MultiByteToWideChar( iSrcCodePage, 0, PChar(s), iLenSrc, nil, 0 );  // How much CHARACTERS are needed?
  SetLength( result, iLenDest );
  if iLenDest> 0 then begin  // Otherwise we get the error ERROR_INVALID_PARAMETER
    if MultiByteToWideChar( iSrcCodePage, 0, PChar(s), iLenSrc, PWideChar(result), iLenDest )= 0 then begin
      //GetLastError();
      result:= '';
    end;
  end;
end;

The source codepage is up to you: maybe

  • 1252 for "Windows-1252" = ANSI Latin 1 Multilingual (Western Europe)
  • 932 for "Shift-JIS X-0208" = IBM-PC Japan MIX (DOS/V) (DBCS) (897 + 301)
  • 28595 for "ISO 8859-5" = Cyrillic
  • 65001 for "UTF-8"

However, if you want to convert from one codepage to another, and both source and target shall not be UTF-16 LE, then you must go forth and back:

  1. Convert from ANSI to WIDE
  2. Convert from WIDE to a different ANSI
// Converting UTF-16 LE (Widestring) to an ANSI charset (String, hopefully you want 65001=UTF-8)
function WideStringToString( s: WideString; iDestCodePage: DWord= CP_UTF8 ): AnsiString;
var
  iLenDest, iLenSrc: Integer;
begin
  iLenSrc:= Length( s );
  iLenDest:= WideCharToMultiByte( iDestCodePage, 0, PWideChar(s), iLenSrc, nil, 0, nil, nil );
  SetLength( result, iLenDest );
  if iLenDest> 0 then begin  // Otherwise we get the error ERROR_INVALID_PARAMETER
    if WideCharToMultiByte( iDestCodePage, 0, PWideChar(s), iLenSrc, PChar(result), iLenDest, nil, nil )= 0 then begin
      //GetLastError();
      result:= '';
    end;
  end;
end;

As per every Windows installation not every codepage is supported, or different codepages are supported, so conversion attempts may fail. It would be more robust to aim for a Unicode program right away, as that is what every Windows installation definitly supports (unless you still deal with Windows 95, Windows 98 or Windows ME).

Combining everything

Now you got everything you need to put it together:

  • you can have a Unicode text control to directly get it in UTF-16 LE
  • you can use an ANSI text control to then convert the input to UTF-16 LE
  • you can convert from UTF-16 LE (WIDE) to UTF-8 (ANSI)

Size

UTF-8 is mostly the best choice, but size wise UTF-16 may need fewer bytes in total when your target audience is Asian: in UTF-8 both and need 3 bytes each, but in UTF-16 both only need 2 bytes each. As per your QR barcode size is an important factor, I guess.

Likewise don't waste by turning binary data (8 bits per byte) into ASCII text (displaying 4 bits per character, but itself needing 1 byte = 8 bits again). Have a look at Base64 which encodes 6 bits into every byte. A concept that you encountered countless times in your life already, because it's used for email attachments.