Getting char value in Delphi 7

3.8k views Asked by At

I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string. For example, "ABCģķī" would result in "ABCģķī"

Now 2 basic things:

  1. Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
  2. Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.

So - How do I get a value of a char, that is in 1-255 range?

I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.

Is there any other method for returning char value? Any help appreciated

3

There are 3 answers

2
Free Consulting On

In case I understood the OP correctly, I'll just leave this here.

function Entitties(const S: WideString): string;
var
  I: Integer;
begin
  Result := '';
  for I := 1 to Length(S) do
  begin
    if Word(S[I]) > Word(High(AnsiChar)) then
      Result := Result + '#' + IntToStr(Word(S[I])) + ';'
    else
      Result := Result + S[I];
  end;
end;
0
Remy Lebeau On

In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABC&#291;&#311;&#299;" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABC&#196;&#163;&#196;&#183;&#196;&#171;" or "ABC&#xC4;&#xA3;&#xC4;&#xB7;&#xC4;&#xAB;" instead. Most other charsets do no support those particular Unicode characters.

In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;&#311;&#299;" or "ABC&#x0123;&#x0137;&#x012B;".

So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:

  1. if you need HTML 4:

    A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.

    B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.

  2. If you need HTML 5:

    A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.

    B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.

4
Roddy On

I suggest you avoid codepages like the plague.

There are two approaches for Unicode that I'd consider: WideString, and UTF-8.

Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.

UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.

@DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.