I want to write to a file with UTF-8 encoding containing the character
10001100
which is Œ
the Latin capital ligature OE in extended ASCII table,
zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)
When I open the file with office(encoding=utf-8), I can see Œ
what I can not read is with readBin?
zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)
There are multiple difficulties here.
Windows-1252
orANSI
, and the Win default "latin" encoding. However the code forŒ
varies within this family of tables. InCP1252
,"Œ"
is represented by10001100
or"\x8c"
, as you wrote. However it does not exist inISO-8859-1
. And inUTF-8
it corresponds to"\xc5\x92"
or"\u0152"
, as rlegendi indicated.So, to write
UTF-8
fromCP1252
-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" fromCP1252
toUTF-8
(in fact convert its byte value to the corresponding one for the same character inUTF-8
), after that you can re-convert it to raw, and finally write to the file:Secondly, when you
readBin()
, do not forget to give a number of bytes to read which is big enough (n=file.info(test.file)$size
here), otherwise it reads only the first byte (see below):zz <- file(test.file, 'rb') x <- readBin(zz, 'raw', n=file.info(test.file)$size) close(zz)
Thirdly, if in the end you want to turn it back into a character, correctly understood and displayed by R, you have first to convert it into a string with
rawToChar()
. Now, the way it will be displayed depends on your default encoding, seeSys.getlocale()
to see what it is (probably something ending with1252
on Windows). The best is probably to specify that your character should be read asUTF-8
– otherwise it will be understood with your default encoding.xx <- rawToChar(x) Encoding(xx) <- "UTF-8"
This should keep things under control, write the correct bytes in
UTF-8
, and be the same on every OS. Hope it helps.PS: I am not exactly sure why in your code
x
returnedc5
, and I guess it would have returnedc5 92
if you had setn=2
(or more) as a parameter toreadBin()
. On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns31
, the hex ASCII representation of'1'
(the first char in'10001100'
, which makes sense), with your code. Maybe you opened your file in Office asCP1252
and saved it asUTF-8
there, before coming back to R?