Importing irregular unseparated text files data in R

149 views Asked by At

I'm having trouble importing this data in R, since it is not in standard csv format. My raw data looks like this:

BC000068032198109TMAX  232  Q  220  Q  220  Q  244  Q  239  Q  246  Q  270  Q  300  Q  327  Q  279  Q  256  Q  260  Q  289  Q  342  Q  357  Q  359  Q  370  Q  373  Q  367  Q  370  Q  372  Q  357  Q  366  Q  365  Q  355  Q  355  Q  364  Q  343  Q  364  Q  362  Q-9999  
BC000068032198110TMIN  180  Q  170  I  150  I  130  I  150  I  130  I  160  I  190  I  190  I  185  Q-9999     130  I  130  I  160  I  170  I  140  I  160  I  160  I  160  I  160  I  160  I-9999     190  I  180  I  160  I  165  Q  210  I  180  I-9999     190  I  170  I

So, basically the first 11 characters of each line are the location's ID, next 4 are the year, following 2 are the month, then there are 4 more which are the variable's name. Finally there are 31 numbers which are the daily measurements for that specific month and variable. -9999 means Not Available. Normally each measurement is followed by a Flag (e.g. "Q" or "I") but NAs are not. I'm not interested in importing these flags. Ultimately I'd like to have a long dataset with each daily measurement being coupled with its date, location and variable name. Thank you for your support.

2

There are 2 answers

0
hrbrmstr On

In the event it's not a fixed-width format, something like:

library(purrr)
library(dplyr)
library(stringi)

lines <- "BC000068032198109TMAX  232  Q  220  Q  220  Q  244  Q  239  Q  246  Q  270  Q  300  Q  327  Q  279  Q  256  Q  260  Q  289  Q  342  Q  357  Q  359  Q  370  Q  373  Q  367  Q  370  Q  372  Q  357  Q  366  Q  365  Q  355  Q  355  Q  364  Q  343  Q  364  Q  362  Q-9999  \nBC000068032198110TMIN  180  Q  170  I  150  I  130  I  150  I  130  I  160  I  190  I  190  I  185  Q-9999     130  I  130  I  160  I  170  I  140  I  160  I  160  I  160  I  160  I  160  I-9999     190  I  180  I  160  I  165  Q  210  I  180  I-9999     190  I  170  I"

readLines(textConnection(lines)) %>%
  map_df(function(x) {

    substr(x, 21, nchar(x)) %>%                   # focus on the part of the line with the readings
      stri_match_all_regex("([-[:digit:]]+)") %>% # pull out all the readings by extracting the #'s
      map(~.[,2]) %>%
      flatten_chr() %>%
      map(~ifelse(. == "-9999", NA, .)) %>%       # make -9999 into NA
      as.numeric() -> value                       # make it a number

    data_frame(
      location_id = substr(x, 1, 11),
      date = as.Date(sprintf("%s-%s-%02d", substr(x, 12, 12+3), substr(x, 16, 16+1), 1:length(value))),
      variable = substr(x, 18, 18+3),
      value = value
    ) %>% filter(!is.na(date)) # don't include invalid dates

  })
## # A tibble: 61 × 4
##    location_id       date variable value
##          <chr>     <date>    <chr> <dbl>
## 1  BC000068032 1981-09-01     TMAX   232
## 2  BC000068032 1981-09-02     TMAX   220
## 3  BC000068032 1981-09-03     TMAX   220
## 4  BC000068032 1981-09-04     TMAX   244
## 5  BC000068032 1981-09-05     TMAX   239
## 6  BC000068032 1981-09-06     TMAX   246
## 7  BC000068032 1981-09-07     TMAX   270
## 8  BC000068032 1981-09-08     TMAX   300
## 9  BC000068032 1981-09-09     TMAX   327
## 10 BC000068032 1981-09-10     TMAX   279
## # ... with 51 more rows

should work.

0
alistaire On

You have a fixed-width file. The base R function for reading such is read.fwf, though the readr package adds a couple alternative ways to enter column widths that can be handy, depending on what you know about your file. In this case, what you know works pretty well with the base version:

x <- 'BC000068032198109TMAX  232  Q  220  Q  220  Q  244  Q  239  Q  246  Q  270  Q  300  Q  327  Q  279  Q  256  Q  260  Q  289  Q  342  Q  357  Q  359  Q  370  Q  373  Q  367  Q  370  Q  372  Q  357  Q  366  Q  365  Q  355  Q  355  Q  364  Q  343  Q  364  Q  362  Q-9999  
BC000068032198110TMIN  180  Q  170  I  150  I  130  I  150  I  130  I  160  I  190  I  190  I  185  Q-9999     130  I  130  I  160  I  170  I  140  I  160  I  160  I  160  I  160  I  160  I-9999     190  I  180  I  160  I  165  Q  210  I  180  I-9999     190  I  170  I'

df <- read.fwf(textConnection(x),    # put the path to your file here
               widths = c(11, 4, 2, 4, rep(c(5, 3), 31)),    # a vector of widths
               na.strings = c('-9999', '   ', '  '))    # set your NA values

df
#>            V1   V2 V3   V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16
#> 1 BC000068032 1981  9 TMAX 232   Q 220   Q 220   Q 244   Q 239   Q 246   Q
#> 2 BC000068032 1981 10 TMIN 180   Q 170   I 150   I 130   I 150   I 130   I
#>   V17 V18 V19 V20 V21 V22 V23 V24 V25  V26 V27 V28 V29 V30 V31 V32 V33 V34
#> 1 270   Q 300   Q 327   Q 279   Q 256    Q 260   Q 289   Q 342   Q 357   Q
#> 2 160   I 190   I 190   I 185   Q  NA <NA> 130   I 130   I 160   I 170   I
#>   V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47  V48 V49 V50 V51 V52
#> 1 359   Q 370   Q 373   Q 367   Q 370   Q 372   Q 357    Q 366   Q 365   Q
#> 2 140   I 160   I 160   I 160   I 160   I 160   I  NA <NA> 190   I 180   I
#>   V53 V54 V55 V56 V57 V58 V59 V60 V61  V62 V63 V64 V65  V66
#> 1 355   Q 355   Q 364   Q 343   Q 364    Q 362   Q  NA <NA>
#> 2 160   I 165   Q 210   I 180   I  NA <NA> 190   I 170    I

Rename and reshape to long form after the fact by the usual means.