How to tidy a fixed width file with headers every n (varies) rows?

342 views Asked by At

I have time series data in a fixed-width-file where observation rows (n varies depending on the sample size) occur under a "header" row which contains the important meta data (ie. sample number, date, etc.). Both types of rows contain alphanumeric characters. It looks something like this (strings shortened for ease of reading:

4  64001416230519844TP blahblah  
5416001130  1 F   492273
5416001140  3 F   492274
5416001145  1 F   492275
5416001150 19 F   492276
5416001155 21 F   492277
5416001160 21 F   492278
5416001165 13 F   492279
5416001170  3 F   492280
5416001180  1 F   492281
4  64001544250619844RA blahblah
5544001125  1 F   492291
5544001130  3 F   492292
5544001135  4 F   492293
5544001140 11 F   492294
5544001145 13 F   492295
4  64002544250619844RA blahblah
etc.

Header rows are distinguished by the first character in the string == 4 and have 89 characters. Observation rows == 5 and have 24 characters.

What I would like is to paste the header row to each subsequent observation row (the subset of data) so that I can later parse the string with read_fwf and be assured that I can sort each observation by the information contained in the header row. I don't care if the original header row is dropped. Like this:

5416001130  1 F   492273 4  64001416230519844TP blahblah  
5416001140  3 F   492274 4  64001416230519844TP blahblah  
5416001145  1 F   492275 4  64001416230519844TP blahblah  
5416001150 19 F   492276 4  64001416230519844TP blahblah  
5416001155 21 F   492277 4  64001416230519844TP blahblah  
5416001160 21 F   492278 4  64001416230519844TP blahblah  
5416001165 13 F   492279 4  64001416230519844TP blahblah  
5416001170  3 F   492280 4  64001416230519844TP blahblah  
5416001180  1 F   492281 4  64001416230519844TP blahblah  
5544001125  1 F   492291 4  64001544250619844RA blahblah
5544001130  3 F   492292 4  64001544250619844RA blahblah
5544001135  4 F   492293 4  64001544250619844RA blahblah
5544001140 11 F   492294 4  64001544250619844RA blahblah
5544001145 13 F   492295 4  64001544250619844RA blahblah
etc...

The closest solution I found was here fwf file with headers every 5th row, headers were characters and observations numeric

The solution provided was a loop that iteratively scrolled through rows and tested if they were characters or numeric and pasted them together accordingly.

    text <- readLines('/path/to/file')                   # read in the file
split_text <- strsplit(text, "\\s+")                 # split each line on whitespace

for (line in split_text) {                           # iterate through lines
  numeric_line <- suppressWarnings(as.numeric(line)) # try to convert the current line into a vector of numbers
  if (is.na(numeric_line[[1]])) {                    # if it fails, we know we're on a header line
    header <- line
  } else {
    for (i in seq(1, length(line), 2)) {             # otherwise, we're on a data line, so take two numbers at once
      print(c(header, line[[i]], line[[i+1]]))       # and output the latest header with each pair of values
    }
  }
}

I tried to adapt this to my data by first reading the fwf with read.fwf() or read_fwf() and defining the first character as a column to distinguish between headers and observations:

    packages = c('tidyverse','rgdal','car','audio','beepr','xlsx','magrittr','lubridate','RColorBrewer','haven')
invisible(lapply(packages, function(x) {if (!require(x, character.only = T)) {install.packages(x);require(x)}}))
DF <- read.fwf("directory/.dat",  widths = c(1, 88), header = FALSE)

My adaptation:

newdf <- for (i in DF) {                           # iterate through lines

  if (DF$V1 == 4) {          # if true, we know we're on a header row
    header <- i
  } else {
    for (i in seq(1, length(DF$V2), 1)) { # otherwise = observation row 
      print(c(header, DF$V2[[i]], DF$V2[[i+1]]))  # and output the latest header with each observation until you hit another header
    }
  }
}
#this is very slow and/or does not work
# I get the following error message
#Warning messages:
1: In if (DF$V1 == 4) { :
  the condition has length > 1 and only the first element will be used

I have also tried specifying headers vs observation rows by nchar() hearders =89 and observations = 24. I realise the loop solution here might be to use ifelse but another issue arises. The dataset is ~39700 rows long and I get new data all the time. Loops would take a long time...

I would like to do this with the data.table or dplyr syntax.

I've tried playing with dplyr::lag as per these posts: dplyr example 1 and dplyr example 2 and get close to what I want:

newdf<-DF %>% 
  mutate(new = replace(lag(V2), V1 != '5', NA))

enter image description here

But as you see the new column only pastes the information from the previous row... as lag() is supposed to do.

Any help would be greatly appreciated, thank you in advance.

As a side note. This data was previously processed in SAS but since I don't do SAS there you go. I do have the SAS code if that helps:

DATA A1;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
       TYPE   1   @ ;
        IF TYPE EQ 4 THEN LINK LIGNE4;
        IF TYPE EQ 5 THEN DELETE;
        RETURN;


LIGNE4:
INPUT             var1     $   6 -  8
                  var2     $   9 - 11
                  var3     12 - 13
                  var4     14 - 15
                  var5     18 - 19
                  var6     $  20 - 22
                  var7     $  44 - 46
                  var8     $    78;


DATA A2;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
       TYPE   1   @ ;
        IF TYPE EQ 4 THEN DELETE;
        IF TYPE EQ 5 THEN LINK LIGNE5;
        RETURN;


LIGNE5:
INPUT             var1     $    5 - 7
                  var2     $    2 - 4
                  varz     8 - 10
                  vara     11 - 13
                  varb     $     15;


DATA A3;
SET A1;
PROC SORT;
     BY var1 var2;
     RUN;

DATA A4;
SET A2;
PROC SORT;
     BY var1 var2;
     RUN;

DATA A5;
MERGE A4 A3;
      BY var1 var2;
      RUN;

As you can see it splits the file, sorts the variables, merges them. However this was done year by year and I want to work with one file with all the years.

4

There are 4 answers

0
You-leee On BEST ANSWER

One other possible solution (no tidyverse) is to read in the file per line, look for header rows and paste those rows at the end of rows without header. After, these lines are splitted and put into a data.frame.

lines <- readLines("asd.dat")

# last index + 1 for iteration
headers <- c(which(grepl("^4 ", lines)), length(lines) + 1) 

pastedLines <- c()
for(i in 1:(length(headers) - 1)) {
  pastedLines <- c(pastedLines, 
                   paste(lines[(headers[i] + 1) : (headers[i + 1] - 1)], lines[headers[i]]))
}

DF <- as.data.frame(matrix(unlist(strsplit(pastedLines, "\\s+")), nrow =  length(pastedLines), byrow=T))

Output:

           V1 V2 V3     V4 V5                  V6       V7
1  5416001130  1  F 492273  4 64001416230519844TP blahblah
2  5416001140  3  F 492274  4 64001416230519844TP blahblah
3  5416001145  1  F 492275  4 64001416230519844TP blahblah
4  5416001150 19  F 492276  4 64001416230519844TP blahblah
5  5416001155 21  F 492277  4 64001416230519844TP blahblah
6  5416001160 21  F 492278  4 64001416230519844TP blahblah
7  5416001165 13  F 492279  4 64001416230519844TP blahblah
8  5416001170  3  F 492280  4 64001416230519844TP blahblah
9  5416001180  1  F 492281  4 64001416230519844TP blahblah
10 5544001125  1  F 492291  4 64001544250619844RA blahblah
11 5544001130  3  F 492292  4 64001544250619844RA blahblah
12 5544001135  4  F 492293  4 64001544250619844RA blahblah
13 5544001140 11  F 492294  4 64001544250619844RA blahblah
14 5544001145 13  F 492295  4 64001544250619844RA blahblah
1
Richard Telford On

Here is a solution using tidyverse.

It makes a new column with just the header rows, and then fills the rows without a header with the header from above. At the end you can paste the columns together if you want.

 x <- read.table(text = "4  64001416230519844TP blahblah  
       5416001130  1 F   492273
       5416001140  3 F   492274
       5416001145  1 F   492275
       5416001150 19 F   492276
       5416001155 21 F   492277
       5416001160 21 F   492278
       5416001165 13 F   492279
       5416001170  3 F   492280
       5416001180  1 F   492281
       4  64001544250619844RA blahblah
       5544001125  1 F   492291
       5544001130  3 F   492292
       5544001135  4 F   492293
       5544001140 11 F   492294
       5544001145 13 F   492295", header = FALSE, sep = "\t")

library("tidyverse")
 x %>% 
   rename(body = V1) %>% 
   mutate(
     body = trimws(body),
     head = if_else(grepl("^4", body), body, NA_character_),
     body = if_else(is.na(head), body, NA_character_)
   ) %>% 
   fill(head, .direction  = "down") %>% 
   filter(!is.na(body))

Output

                       body                            head
1  5416001130  1 F   492273 4  64001416230519844TP blahblah
2  5416001140  3 F   492274 4  64001416230519844TP blahblah
3  5416001145  1 F   492275 4  64001416230519844TP blahblah
4  5416001150 19 F   492276 4  64001416230519844TP blahblah
5  5416001155 21 F   492277 4  64001416230519844TP blahblah
6  5416001160 21 F   492278 4  64001416230519844TP blahblah
7  5416001165 13 F   492279 4  64001416230519844TP blahblah
8  5416001170  3 F   492280 4  64001416230519844TP blahblah
9  5416001180  1 F   492281 4  64001416230519844TP blahblah
10 5544001125  1 F   492291 4  64001544250619844RA blahblah
11 5544001130  3 F   492292 4  64001544250619844RA blahblah
12 5544001135  4 F   492293 4  64001544250619844RA blahblah
13 5544001140 11 F   492294 4  64001544250619844RA blahblah
14 5544001145 13 F   492295 4  64001544250619844RA blahblah
3
Jaap On

Two options with base R. Both use readLines to read the raw text data (see the end of this answer).

Option 1:

i <- grepl(pattern = '^4 ', x)
x1 <- strsplit(x[!i], '\\s+')
x2 <- strsplit(x[i], '\\s+')

d1 <- do.call(rbind.data.frame, x1)
d2 <- do.call(rbind.data.frame, x2)

d <- cbind(d1, d2[cumsum(i)[-which(i)],])
names(d) <- paste0('V',1:ncol(d))

which gives:

> d
            V1 V2 V3     V4 V5                  V6       V7
1   5416001130  1  F 492273  4 64001416230519844TP blahblah
1.1 5416001140  3  F 492274  4 64001416230519844TP blahblah
1.2 5416001145  1  F 492275  4 64001416230519844TP blahblah
1.3 5416001150 19  F 492276  4 64001416230519844TP blahblah
1.4 5416001155 21  F 492277  4 64001416230519844TP blahblah
1.5 5416001160 21  F 492278  4 64001416230519844TP blahblah
1.6 5416001165 13  F 492279  4 64001416230519844TP blahblah
1.7 5416001170  3  F 492280  4 64001416230519844TP blahblah
1.8 5416001180  1  F 492281  4 64001416230519844TP blahblah
2   5544001125  1  F 492291  4 64001544250619844RA blahblah
2.1 5544001130  3  F 492292  4 64001544250619844RA blahblah
2.2 5544001135  4  F 492293  4 64001544250619844RA blahblah
2.3 5544001140 11  F 492294  4 64001544250619844RA blahblah
2.4 5544001145 13  F 492295  4 64001544250619844RA blahblah

Option 2:

rawlist <- split(x, cumsum(grepl(pattern = '^4 ', x)))

l1 <- lapply(rawlist, function(x) read.table(text = x, skip = 1, header = FALSE))
l2 <- lapply(rawlist, function(x) read.table(text = x, nrows = 1, header = FALSE))
reps <- sapply(l1, nrow)

d1 <- do.call(rbind, l1)
d2 <- do.call(rbind, l2)[rep(1:length(l2), reps),]

d <- cbind(d1, d2)
names(d) <- paste0('V',1:ncol(d))

which gives:

> d
            V1 V2    V3     V4 V5                  V6       V7
1.1 5416001130  1 FALSE 492273  4 64001416230519844TP blahblah
1.2 5416001140  3 FALSE 492274  4 64001416230519844TP blahblah
1.3 5416001145  1 FALSE 492275  4 64001416230519844TP blahblah
1.4 5416001150 19 FALSE 492276  4 64001416230519844TP blahblah
1.5 5416001155 21 FALSE 492277  4 64001416230519844TP blahblah
1.6 5416001160 21 FALSE 492278  4 64001416230519844TP blahblah
1.7 5416001165 13 FALSE 492279  4 64001416230519844TP blahblah
1.8 5416001170  3 FALSE 492280  4 64001416230519844TP blahblah
1.9 5416001180  1 FALSE 492281  4 64001416230519844TP blahblah
2.1 5544001125  1 FALSE 492291  4 64001544250619844RA blahblah
2.2 5544001130  3 FALSE 492292  4 64001544250619844RA blahblah
2.3 5544001135  4 FALSE 492293  4 64001544250619844RA blahblah
2.4 5544001140 11 FALSE 492294  4 64001544250619844RA blahblah
2.5 5544001145 13 FALSE 492295  4 64001544250619844RA blahblah

Used data:

x <- readLines(textConnection('4  64001416230519844TP blahblah  
5416001130  1 F   492273
5416001140  3 F   492274
5416001145  1 F   492275
5416001150 19 F   492276
5416001155 21 F   492277
5416001160 21 F   492278
5416001165 13 F   492279
5416001170  3 F   492280
5416001180  1 F   492281
4  64001544250619844RA blahblah
5544001125  1 F   492291
5544001130  3 F   492292
5544001135  4 F   492293
5544001140 11 F   492294
5544001145 13 F   492295'))

To read your actual data, you can use something like:

x <- readLine('name-of-datafile.txt')
0
Aurèle On

Here is a possible base R solution that tries to be a little more memory efficient:

rawtext <- "4  64001416230519844TP blahblah  
5416001130  1 F   492273
5416001140  3 F   492274
5416001145  1 F   492275
5416001150 19 F   492276
5416001155 21 F   492277
5416001160 21 F   492278
5416001165 13 F   492279
5416001170  3 F   492280
5416001180  1 F   492281
4  64001544250619844RA blahblah
5544001125  1 F   492291
5544001130  3 F   492292
5544001135  4 F   492293
5544001140 11 F   492294
5544001145 13 F   492295"

First read the data once, and get the header line numbers. Note this can be done with a command line utility such as... grep, outside of R:

text <- readLines(textConnection(rawtext))
header_rows <- grep("^4", text)
lengths <- diff(c(header_rows, length(text) + 1)) - 1
rm(text)

Then actually re-read each piece, but only the necessary amount of lines:

do.call(rbind, mapply(
  function(skip, nrows, ...) data.frame(
    read.table(skip = skip, nrows = nrows, ...),
    read.table(skip = skip - 1, nrows = 1, ...)
  ),
  MoreArgs = list(text = rawtext),
  header_rows,
  lengths,
  SIMPLIFY = FALSE
))

#            V1 V2    V3     V4 V1.1                V2.1     V3.1
# 1  5416001130  1 FALSE 492273    4 64001416230519844TP blahblah
# 2  5416001140  3 FALSE 492274    4 64001416230519844TP blahblah
# 3  5416001145  1 FALSE 492275    4 64001416230519844TP blahblah
# 4  5416001150 19 FALSE 492276    4 64001416230519844TP blahblah
# 5  5416001155 21 FALSE 492277    4 64001416230519844TP blahblah
# 6  5416001160 21 FALSE 492278    4 64001416230519844TP blahblah
# 7  5416001165 13 FALSE 492279    4 64001416230519844TP blahblah
# 8  5416001170  3 FALSE 492280    4 64001416230519844TP blahblah
# 9  5416001180  1 FALSE 492281    4 64001416230519844TP blahblah
# 10 5544001125  1 FALSE 492291    4 64001544250619844RA blahblah
# 11 5544001130  3 FALSE 492292    4 64001544250619844RA blahblah
# 12 5544001135  4 FALSE 492293    4 64001544250619844RA blahblah
# 13 5544001140 11 FALSE 492294    4 64001544250619844RA blahblah
# 14 5544001145 13 FALSE 492295    4 64001544250619844RA blahblah