extract value between specific string and colon in R

171 views Asked by At

I have a table example like this

No, Memo
  1, Date: 2020/10/22 City: UA Note: True mastery of any skill takes a lifetime.
  2, Date: 2022/11/01 City: CH Note: Sweat is the lubricant of success.
  3, Date: 2022y11m1d City: UA Note: Every noble work is at first impossible.
  4, Date: 2022y2m15d City: AA Note: Live beautifully, dream passionately, love completely.

I want to extract string after Date: ,City: and Note:. For example at NO. 1,I need to extract the "2020/10/22" which is between Date: and City:, "UA" which is between City: and Note:, and the "True mastery of any skill takes a lifetime." which is after Note:.

Desired Output like :

 No Date       City Note
  1 2020/10/22 UA   True mastery of any skill takes a lifetime.
  2 2022/11/01 CH   Sweat is the lubricant of success.
  3 2022y11m1d UA   Every noble work is at first impossible.
  4 2022y2m15d AA   Live beautifully, dream passionately, love completely.

Does anyone know an answer for that?Any help would be great.Thank you.

2

There are 2 answers

0
f.lechleitner On BEST ANSWER

My solution using regex and stringr and dplyr

library(stringr)
library(dplyr)

df <- read.table(
  text = "No; Memo
  1; Date: 2020/10/22 City: UA Note: True mastery of any skill takes a lifetime.
  2; Date: 2022/11/01 City: CH Note: Sweat is the lubricant of success.
  3; Date: 2022y11m1d City: UA Note: Every noble work is at first impossible.
  4; Date: 2022y2m15d City: AA Note: Live beautifully, dream passionately, love completely.",
  sep = ";",
  header = T
)

df_test <- df %>% mutate(date = str_extract(Memo, "(?<=Date: )(.*)(?= City)"),
                         city = str_extract(Memo, "(?<=City: )(.*)(?= Note)"),
                         note = str_extract(Memo, "(?<=Note: ).*")) %>%
  select(-Memo)


> df_test
  No       date city                                                   note
1  1 2020/10/22   UA            True mastery of any skill takes a lifetime.
2  2 2022/11/01   CH                     Sweat is the lubricant of success.
3  3 2022y11m1d   UA               Every noble work is at first impossible.
4  4 2022y2m15d   AA Live beautifully, dream passionately, love completely.

The regex matches everything between the groups specified using positive lookahead and loohbehind.

0
G. Grothendieck On

Place a newline before each keyword in Memo at which point it is in dcf format so read that using read.dcf. This is general, not depending on the particular keywords in Memo, and does not depend on any packages.

DF |>
  transform(Memo = gsub("(\\w+: )", "\n\\1", Memo)) |>
  with(data.frame(No, read.dcf(textConnection(Memo))))

giving

  No       Date City                                                   Note
1  1 2020/10/22   UA            True mastery of any skill takes a lifetime.
2  2 2022/11/01   CH                     Sweat is the lubricant of success.
3  3 2022y11m1d   UA               Every noble work is at first impossible.
4  4 2022y2m15d   AA Live beautifully, dream passionately, love completely.

Note

DF <- data.frame(
  No = 1:4,
  Memo = c(
    "Date: 2020/10/22 City: UA Note: True mastery of any skill takes a lifetime.",
    "Date: 2022/11/01 City: CH Note: Sweat is the lubricant of success.",
    "Date: 2022y11m1d City: UA Note: Every noble work is at first impossible.",
    "Date: 2022y2m15d City: AA Note: Live beautifully, dream passionately, love completely."
  )
)