Control encoding when parsing SPSS file using package memisc

583 views Asked by At

I have been given a SPSS system file that I would like to analyse using R. I am using the following magic for parsing the file into R.

library(memisc)
foo <- spss.system.file("foobar.sav")
bar <- subset(foo, select=c(var1,var2,var3))

When having a look at the parsed data, you get the following:

> bar
Data set with 379 observations and 3 variables

var1       var2        var3
1      gut    weiblich      Herbst
2      gut mnlich      Sommer
3      gut mnlich      Sommer
4      gut mnlich      Winter
5      gut mnlich Fr�hling
6      gut mnlich Fr�hling
7      gut    weiblich Fr�hling
.
.
.
25      gut    weiblich Fr�hling
.. ........ ........... ...........
(27 of 379 observations shown)

I guess you get the idea. I am relatively sure that the .sav-file has been saved using the latin1-encoding. How can I tell spss.system.file() to use this encoding when parsing the SPSS-file?

3

There are 3 answers

0
Thomas Möbius On BEST ANSWER

Thank you everyone for your help. I will be answering my own question. spss.system.file() reads strings contained in SPSS files as-is, without any translation. The resulting strings therefore do not contain any encoding information. The memisc package contains a function Iconv, however, that does exactly what the Unix function iconv would do.

> library(memisc)
> foo <- spss.system.file("foobar.sav")
> foo <- Iconv(foo,from="Latin1",to="UTF-8")
> foo <- as.data.frame(as.data.set(foo))
> head(foo$Geschlecht)
[1] weiblich männlich männlich männlich männlich männlich
Levels: männlich weiblich

All the best.

1
Zoltan Fabian On

This problem could be specific to memisc package. As a quick solution try read.spss function of foreign package, if you do not want to stick to memisc. Consider also to add memisc tag to your question.

0
JKP On

That output indicates clearly that the function is not taking into account the character encoding in the file or that the encoding is not correctly declared. Those ? characters indicate a misinterpreted or incorrectly written character. I expected them to be u-umlauts, but in code page 1252 e4 is actually a-umlaut.

Sav files have their encoding marked, so it should be respected. If the file was created by SPSS, the marking will be correct, however we have seen cases where sav files written by third-party code does not mark the file correctly.

I'm pretty sure that this file is actually written in code page 1252, but the encoding is probably declared, incorrectly, as utf-8, assuming that the display above would actually represent extended characters properly.

The SPSS SYSFILE INFO command will show the declared encoding, if any, but you can also look at a hex dump of the first part of the file and see it.