R - How to extract text between string and blank line?

1k views Asked by At

I am using readLines to read the content of the following text file:

*--------------------------------------------------------------------*
* 7. Measured data                                                   *
*    And option to force measured LAI during simulation              *
*    (instead of using simulated values)                             *
*--------------------------------------------------------------------*
* Observed phenology: only required if program DRATES is run!!
IDOYTR = 194    ! Day of transplanting (give 0 if direct-seeded)
IYRTR  = 1991  ! Year of transplanting (give 0 if direct-seeded)
IDOYPI = 240   ! Day of panicle initiation (give -99 if not observed)
IYRPI  = 1991  ! Year of panicle initiation (give -99 if not observed)
IDOYFL = 260   ! Day of flowering
IYRFL  = 1991  ! Year of flowering
IDOYM  = 288   ! Day of maturity
IYRM   = 1991  ! Year of maturity


*Leaf Area Index (m2 leaf / m2 ground):
LAI_OBS =
1991., 182.,  0.00 ,
1991., 194.,  0.028,
1991., 202.,  0.185,
1991., 211.,  0.325,
1991., 219.,  1.048,
1991., 240.,  3.680,
1991., 254.,  5.010,
1991., 260.,  4.628,
1991., 273.,  3.520,
1991., 288.,  1.938

*-- Parameter to set forcing of observed LAI during simulation
LAI_FRC = 0       ! No forcing 
*LAI_FRC = 2      ! Forcing 

And I need to programmatically extract only the block of text identified by LAI_OBS =. The line number where LAI_OBS = is located varies from file to file. Therefore, I need to find a way to read all the text between the string LAI_OBS = and the next blank line.

So far I am using:

l <- readLines('file.txt')
which(obs.lai=='LAI_OBS =')

I can identify the initial line of the block I need to extract, but I don't know how to instruct R to go to the first blank line after LAI_OBS =.

The result I need is a data frame looking like this:

1991  182  0.00
1991  194  0.028
1991  202  0.185
1991  211  0.325
1991  219  1.048
1991  240  3.680
1991  254  5.010
1991  260  4.628
1991  273  3.520
1991  288  1.938

What is a convenient way to do this in R? Thanks.

3

There are 3 answers

1
akrun On BEST ANSWER

Get the index of the "LAI_OBS" (it looks like == can be used for in case if it is not a fixed case, then grep is more useful. Then, get the index of blank elements with nzchar, select the first empty index which is greater than 'i1', get the sequence from 'i1' to 'i2' (after making adjustments i.e. adding 1 and subtracting 1), remove the extra characters using sub/gsub and read with read.csv

i1 <- grep("LAI_OBS =", l)+1
i2 <- which(!nzchar(l)) 
i3 <- i2[i2>i1][1]-1    
read.csv(text=gsub("\\.,", ",", sub("\\s*,$", "", l[i1:i3])), header=FALSE)
#    V1  V2    V3
#1  1991 182 0.000
#2  1991 194 0.028
#3  1991 202 0.185
#4  1991 211 0.325
#5  1991 219 1.048
#6  1991 240 3.680
#7  1991 254 5.010
#8  1991 260 4.628
#9  1991 273 3.520
#10 1991 288 1.938
0
Matt S On

This works, not elegant but gets the job done:

l <- readLines('data.txt')
first <- which(l=='LAI_OBS =')
blanks <- which(l=='')
whichblank <- which(which(l=='') > first)
last <- blanks[whichblank]
first
last

outputs:

[1] 18 [1] 29

Of course if there are more blank lines in the file you would just grab the first from whichblank

0
Tim Biegeleisen On

From what I gather, the tricky part about your input file is being able to articulate where the input data ends. One approach is to continue down your current path and use which again to match the following line:

*-- Parameter to set forcing of observed LAI during simulation

idx1 <- which(obs.lai=='LAI_OBS =')
idx2 <- which(substring(obs.lai, 1, 20) == '*-- Parameter to set')

df.keep <- obs.lai[idx1:idx2-1, ]

Note that if the file has multiple lines beginning with the 20 characters I attempt to match, you might have to increase the length of the substring. My hunch is that the full line would be unique because it refers to LAI simulation.