I am using R to clean street addresses from Hawaii. The addresses have been inputed with Hawaiian diacritical marks. When using R on an OSX operating system, I can easily use gsub() to remove the diacritics; however, PC running 64-bit Windows machines running R show strange characters, such as "â€" in place of the okina (‘). I suspect it could be an encoding issue, and have included the encoding parameter like the following:
address_file <- read.csv("file.csv", encoding="UTF-8")
Although most of the strange encoding was solved, R no longer could recognize certain diacritics such as the okina. For example, I would use the following syntax, but the okina will not be removed:
gsub("‘", "", hiplaces$name)
Can someone please help with solving this issue on a PC running 64-bit Windows. I suspect it could be 1) an encoding issue and I am choosing the incorrect encoding, or 2) a gsub solution that can remove/replace diacritics. The data I am trying to clean looks something like below:
hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church", "‘Ōla‘a First Hawaiian Congregational Church", "Nā‘ālehu Community Center")
gsub("‘", "", hiplaces$name)
TIA.
Since your end result is a set of street addresses, you should be OK with simply retaining only alphanumeric characters. Under this assumption, the following should work: