Munging text strings with okinas and other Hawaiian diacritical marks

143 views Asked by At

I am using R to clean street addresses from Hawaii. The addresses have been inputed with Hawaiian diacritical marks. When using R on an OSX operating system, I can easily use gsub() to remove the diacritics; however, PC running 64-bit Windows machines running R show strange characters, such as "â€" in place of the okina (‘). I suspect it could be an encoding issue, and have included the encoding parameter like the following:

address_file <- read.csv("file.csv", encoding="UTF-8")

Although most of the strange encoding was solved, R no longer could recognize certain diacritics such as the okina. For example, I would use the following syntax, but the okina will not be removed:

gsub("‘", "", hiplaces$name) 

Can someone please help with solving this issue on a PC running 64-bit Windows. I suspect it could be 1) an encoding issue and I am choosing the incorrect encoding, or 2) a gsub solution that can remove/replace diacritics. The data I am trying to clean looks something like below:

hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church", "‘Ōla‘a First Hawaiian    Congregational Church", "Nā‘ālehu Community Center")

gsub("‘", "", hiplaces$name) 

TIA.

1

There are 1 answers

0
Tim Biegeleisen On BEST ANSWER

Since your end result is a set of street addresses, you should be OK with simply retaining only alphanumeric characters. Under this assumption, the following should work:

hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church",
                   "‘Ōla‘a First Hawaiian    Congregational Church",
                   "Nā‘ālehu Community Center")

hiplaces$name <- gsub("[^[:alnum:]///' ]", "", hiplaces$name)

> hiplaces$name
[1] "Imiola Congregational Church"
[2] "Olaa First Hawaiian    Congregational Church"
[3] "Naalehu Community Center"