Munging text strings with okinas and other Hawaiian diacritical marks

Question

Munging text strings with okinas and other Hawaiian diacritical marks

173 views Asked by yokota At 11 June 2015 at 05:35

I am using R to clean street addresses from Hawaii. The addresses have been inputed with Hawaiian diacritical marks. When using R on an OSX operating system, I can easily use gsub() to remove the diacritics; however, PC running 64-bit Windows machines running R show strange characters, such as "â€" in place of the okina (‘). I suspect it could be an encoding issue, and have included the encoding parameter like the following:

address_file <- read.csv("file.csv", encoding="UTF-8")

Although most of the strange encoding was solved, R no longer could recognize certain diacritics such as the okina. For example, I would use the following syntax, but the okina will not be removed:

gsub("‘", "", hiplaces$name)

Can someone please help with solving this issue on a PC running 64-bit Windows. I suspect it could be 1) an encoding issue and I am choosing the incorrect encoding, or 2) a gsub solution that can remove/replace diacritics. The data I am trying to clean looks something like below:

hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church", "‘Ōla‘a First Hawaiian    Congregational Church", "Nā‘ālehu Community Center")

gsub("‘", "", hiplaces$name)

TIA.

Original Q&A

There are 1 answers

**Tim Biegeleisen** · Accepted Answer · 2015-06-11T05:52:51+00:00

Since your end result is a set of street addresses, you should be OK with simply retaining only alphanumeric characters. Under this assumption, the following should work:

hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church",
                   "‘Ōla‘a First Hawaiian    Congregational Church",
                   "Nā‘ālehu Community Center")

hiplaces$name <- gsub("[^[:alnum:]///' ]", "", hiplaces$name)

> hiplaces$name
[1] "Imiola Congregational Church"
[2] "Olaa First Hawaiian    Congregational Church"
[3] "Naalehu Community Center"

TechQA.

Munging text strings with okinas and other Hawaiian diacritical marks

There are 1 answers

Related Questions in R

Related Questions in ENCODING

Related Questions in DATA-CLEANING

Popular Questions

Popular Tags

Trending Questions