I have a large text file that contains multiple data frames, separated by header lines, that I am trying to read into R. The first header line contains the time variable. I want to separate the data frames according to their time variable. The data looks like this:
data = c("** TIME: 41670", "** PROPERTY: Pressure", "** UNITS: psi",
"< X > < Y > <Layer 1>", " 2106604.41 3434119.83 7952.25",
" 2111884.40 3434119.83 7970.05", " 2037964.57 3439399.82 7658.27",
" 2043244.56 3439399.82 7754", " 2048524.55 3439399.82 7828.24",
" 2053804.53 3439399.82 7879.78", " 2059084.52 3439399.82 7914.57",
" 2064364.50 3439399.82 7944.66", " 2069644.49 3439399.82 7974.44",
" 2074924.48 3439399.82 7999.03", " 2080204.46 3439399.82 8014.14",
" 2085484.46 3439399.82 8016.27", " 2090764.46 3439399.82 8005.63",
"", "", "** TIME: 41670", "** PROPERTY: Pressure", "** UNITS: psi",
"< X > < Y > <Layer 2>", " 2106604.41 3434119.83 8038.52",
" 2111884.40 3434119.83 8066.89", " 2037964.57 3439399.82 7723.84",
" 2043244.56 3439399.82 7821.79", " 2048524.55 3439399.82 7899.46",
" 2053804.53 3439399.82 7955.23", " 2059084.52 3439399.82 7993.75",
" 2064364.50 3439399.82 8026.08", " 2069644.49 3439399.82 8056.41",
" 2074924.48 3439399.82 8080.33", " 2080204.46 3439399.82 8094.15",
" 2085484.46 3439399.82 8095.07", " 2090764.46 3439399.82 8084.03",
" 2096044.44 3439399.82 8068.33", " 2101324.41 3439399.82 8060.14",
" 2106604.41 3439399.82 8073.08", " 2111884.40 3439399.82 8107.82",
" 2117164.38 3439399.82 8145.84", " 2122444.37 3439399.82 8160.57"
)
I am using readLines to read in the text file.
I would ideally like a list with the timestamp attached to a data frame of values like:
[[1]]$date
[1] "2014-01-31"
[[1]]$data
X Y Layer1
1 2106604.41 3434119.83 7952.25
2 2111884.40 3434119.83 7970.05
3 2037964.57 3439399.82 7658.27
4 2043244.56 3439399.82 7754
[[2]]$date
[1] "2014-01-31"
[[2]]$data
X Y Layer2
1 2106604.41 3434119.83 8038.52
2 2111884.40 3434119.83 8066.89
3 2037964.57 3439399.82 7723.84
4 2043244.56 3439399.82 7821.79
This is what I tried:
data <- readLines("tmp.txt")
# Initialize an empty list to store data frames
dfs <- list()
# Initialize variables
current_time <- NULL
current_df <- NULL
property <- NULL
# Loop through each line of the file
for (line in data) {
if (startsWith(line, "** TIME:")) {
# Extract the time from the header line and convert to datetime
current_time <- as.Date(as.numeric(trimws(sub("\\*{2}\\s+TIME:\\s+", "", line))), origin = "1899-12-30", format = "%Y-%m-%d")
# Create a new data frame for the current time
current_df <- data.frame()
} else if (startsWith(line, "** PROPERTY:")) {
next
} else if (startsWith(line, "** UNITS:")) {
next
} else if (startsWith(line, "<")) {
# Extract column names from header line 4
clean_header <- gsub("<|>", "", line)
clean_header <- trimws(clean_header)
col_names <- strsplit(clean_header, " ")
col_names <- unlist(col_names)
col_names <- col_names[col_names != ""]
col_names[3] <- paste0(col_names[3], col_names[4])
col_names <- col_names[-4]
} else if (!startsWith(line, "**")) {
# Split the line by whitespace and create a new row in the data frame
parts <- strsplit(line, "\\s+")[[1]]
parts <- parts[parts != ""]
current_df <- rbind(current_df, as.numeric(parts))
} else {
# End of current data frame, store it in the list
colnames(current_df) <- col_names
dfs[[length(dfs) + 1]] <- list(date = current_time, data = current_df)
current_df <- NULL
}
}
The code is producing the current_df, which stores the most recent data frame it is looped over, but the column names are not being added. Additionally, the current_df is not being saved to the dfs list , so it is overwritten by the new current_df as the loop continues.
Very close! Only three small changes are needed:
!startsWith(line, "**"), matches an empty line. This means that empty lines were being treated as data, and the final condition (which finalizes the dataframe and adds it to the list) was never reached. I changed the condition tonchar(line) > 0 & !startsWith(line, "**").current_dftoNULL, and then the next time the loop runs, the operations oncurrent_dffail. To avoid this problem, I changed the finalelsetoelse if(!is.null(current_df)).datacontains a row of data; this means that the final condition never runs for the last dataframe, and therefore the last dataframe never gets added to the list. I added one more""line todatato fix this problem. (Alternatively, we could copy the contents of that last condition and run them one more time after the whole loop has run.)Here's what the code looks like with those three changes: