How to split in R

63 views Asked by At

I have a multicentric study whose location has been imported. For example "Netra Jyothi Charitable Trust, C/o Prasad Netralaya Super specialty Eye Hospital,KARNATAKAYenepoya Medical College Hospital,KARNATAKA". I want to split "KARNATAKAYenepoya" in-between "KARNATAKA" and "Yenepoya". I am unable to create the code in such a way that it will automatically detect if there is any state name in caps like "KARNATAKA" and make a split after that.

I have tried using the code:

text <- "KARNATAKAYenepoya"
state_name <- str_extract(text, "[A-Z]+")
print(state_name)

But I am getting the value "KARNATAKAY" instead of "KARNATAKA".

3

There are 3 answers

0
Artëm Sozontov On

In your and a few similar cases try this one:

library(tidyverse)
text = c(
    "Netra Jyothi Charitable Trust, C/o Prasad Netralaya Super specialty Eye Hospital,KARNATAKAYenepoya Medical College Hospital,KARNATAKA", 
    "KARNATAKAYenepoya"
    )

tibble(text) %>% 
    mutate(text1 = str_extract_all(text, "[A-Z]{3,}")[[1]][1], 
           text1 = substr(text1, 1, nchar(text1)-1), 
           pos = str_locate(text, text1)[,2],
           text2 = substr(text, pos+1, nchar(text)),
           text2 = str_extract(text2, "^[^\\s]+")
           ) %>% 
    select(text1, text2, text)

Which output will be:

# A tibble: 2 × 3
  text1     text2    text                                 
  <chr>     <chr>    <chr>                                
1 KARNATAKA Yenepoya Netra Jyothi Charitable Trust, C/o P…
2 KARNATAKA Yenepoya KARNATAKAYenepoya   
0
AnilGoyal On

From OP's question, what I have understood that she wants a whitespace inserted wherever any Indian State name is encountered.

This may be the strategy (Note that we have to store Indian States/UT names first)

library(stringr)

indian_states <- c(
  "Andhra Pradesh",
  "Arunachal Pradesh",
  "Assam",
  "Bihar",
  "Chhattisgarh",
  "Goa",
  "Gujarat",
  "Haryana",
  "Himachal Pradesh",
  "Jharkhand",
  "Karnataka",
  "Kerala",
  "Madhya Pradesh",
  "Maharashtra",
  "Manipur",
  "Meghalaya",
  "Mizoram",
  "Nagaland",
  "Odisha",
  "Punjab",
  "Rajasthan",
  "Sikkim",
  "Tamil Nadu",
  "Telangana",
  "Tripura",
  "Uttar Pradesh",
  "Uttarakhand",
  "West Bengal",
  "Andaman and Nicobar Islands",
  "Chandigarh",
  "Dadra and Nagar Haveli",
  "Daman",
  "Diu",
  "Lakshadweep",
  "Delhi",
  "Puducherry",
  "Jammmu and Kashmir",
  "Ladakh"
)

indian_states <- toupper(indian_states)

text <- "KARNATAKAYenepoya"

str_replace(text, 
            indian_states[str_detect(text, indian_states)], 
            paste0(indian_states[str_detect(text, indian_states)], " "))
#> [1] "KARNATAKA Yenepoya"

Created on 2024-03-06 with reprex v2.0.2

1
Oindrila Roy Chowdhury On

how to split "MAHARASHTRATATA MEMORIAL HOSPITAL" as "MAHARASHTRA" and "TATA MEMORIAL HOSPITAL"