Encoding 'utf-16' is not consistent when convert lisp string from/to C string

227 views Asked by At

I find when use 'utf-16' as the encoding to convert a lisp string to C string with cffi, the actual encoding used is 'utf-16le'. But, when convert C string back to lisp string, the actual encoding used is 'utf-16be'. Since I'm not familiar with 'babel' yet (which provides the encoding facility for 'cffi'), I'm not sure whether that's a bug.

(defun convtest (str to-c from-c)
  (multiple-value-bind (ptr size)
      (cffi:foreign-string-alloc str :encoding to-c)
    (declare (ignore size))
    (prog1
        (cffi:foreign-string-to-lisp ptr :encoding from-c)
      (cffi:foreign-string-free ptr))))

(convtest "hello" :utf-16   :utf-16)     ;=> garbage string
(convtest "hello" :utf-16   :utf-16le)   ;=> "hello"
(convtest "hello" :utf-16   :utf-16be)   ;=> garbage string
(convtest "hello" :utf-16le :utf-16be)   ;=> garbage string
(convtest "hello" :utf-16le :utf-16le)   ;=> "hello"

The `convtest' convert a lisp string to C string then back to lisp string, with the `to-c', `from-c' as encoding. All the output garbage string are the same. From the test we see that if we use 'utf-16' as `to-c' and `from-c' at the same time, the conversion failed.

1

There are 1 answers

2
Rainer Joswig On

Here the encoding to-c assumes little endian (le) by default. From-c then has big-endian as default (be).

The platform itself (x86) is little endian. UTF-16 prefers big endian or takes the information from a byte-order mark.

This probably depends on the platform you are running on? Platforms seem to have different defaults.

Best to look into the source code, why those encodings are chosen. Also you may ask on the CFFI mailing list about the encoding choices and how they depend on the platform, if at all.