Control encoding when parsing SPSS file using package memisc

Question

Control encoding when parsing SPSS file using package memisc

592 views Asked by Thomas Möbius At 09 June 2015 at 12:00

I have been given a SPSS system file that I would like to analyse using R. I am using the following magic for parsing the file into R.

library(memisc)
foo <- spss.system.file("foobar.sav")
bar <- subset(foo, select=c(var1,var2,var3))

When having a look at the parsed data, you get the following:

> bar
Data set with 379 observations and 3 variables

var1       var2        var3
1      gut    weiblich      Herbst
2      gut mnlich      Sommer
3      gut mnlich      Sommer
4      gut mnlich      Winter
5      gut mnlich Fr�hling
6      gut mnlich Fr�hling
7      gut    weiblich Fr�hling
.
.
.
25      gut    weiblich Fr�hling
.. ........ ........... ...........
(27 of 379 observations shown)

I guess you get the idea. I am relatively sure that the .sav-file has been saved using the latin1-encoding. How can I tell spss.system.file() to use this encoding when parsing the SPSS-file?

Original Q&A

There are 3 answers

Zoltan Fabian On 09 June 2015 at 13:04

This problem could be specific to memisc package. As a quick solution try read.spss function of foreign package, if you do not want to stick to memisc. Consider also to add memisc tag to your question.

JKP On 10 June 2015 at 17:41

That output indicates clearly that the function is not taking into account the character encoding in the file or that the encoding is not correctly declared. Those ? characters indicate a misinterpreted or incorrectly written character. I expected them to be u-umlauts, but in code page 1252 e4 is actually a-umlaut.

Sav files have their encoding marked, so it should be respected. If the file was created by SPSS, the marking will be correct, however we have seen cases where sav files written by third-party code does not mark the file correctly.

I'm pretty sure that this file is actually written in code page 1252, but the encoding is probably declared, incorrectly, as utf-8, assuming that the display above would actually represent extended characters properly.

The SPSS SYSFILE INFO command will show the declared encoding, if any, but you can also look at a hex dump of the first part of the file and see it.

**Thomas Möbius** · Accepted Answer · 2015-06-15T07:17:21+00:00

Thank you everyone for your help. I will be answering my own question. spss.system.file() reads strings contained in SPSS files as-is, without any translation. The resulting strings therefore do not contain any encoding information. The memisc package contains a function Iconv, however, that does exactly what the Unix function iconv would do.

> library(memisc)
> foo <- spss.system.file("foobar.sav")
> foo <- Iconv(foo,from="Latin1",to="UTF-8")
> foo <- as.data.frame(as.data.set(foo))
> head(foo$Geschlecht)
[1] weiblich männlich männlich männlich männlich männlich
Levels: männlich weiblich

All the best.

TechQA.

Control encoding when parsing SPSS file using package memisc

There are 3 answers

Related Questions in R

Related Questions in ENCODING

Related Questions in UTF-8

Related Questions in SPSS

Related Questions in LATIN1

Popular Questions

Popular Tags

Trending Questions