I am dealing with time series data for different individuals in a wide format. The number of time points differ between individuals. Now, the thing is that I need the last element for each individual.
I was thinking of using a list as a column in my tibble to store the time series sequences. (Putting each time point into a different column is probably not a good idea, since there can be hundreds of possible time points, but an individual could have data only for a handful of them, however, the data per individual is always measured for consecutive time points.)
Let's call it column1, i.e.:
library(tibble)
# Create an example dataframe
df <- tibble(
column1 = list(1:3, 1:4, 4:8)
)
Now, I would like to use a vectorization for the sake of efficiency and speed, but is it even possible with the given data structure. There is a function called map() in purrr package, with which the operation would go like:
library(purrr)
# Use the map function to select the last element of each vector
last_elements <- map(df$column1, ~ .x[length(.x)])
But this is not vectorization, but rather looping through the elements of the list (stored as column1), right?
Would there be a better (i.e. faster / more efficient) choice for a data structure than a list as a column? Or is this in general the best way to handle this kind of situation?
It is important to understand that
vectorizationinRmeans that you do not need a loop, b/c somebody else did the loop already for you. Important part is that eventually there will be some sort of loop.The reason why "often" relying on out-of-the-shelf vectorization is faster than writing your loop yourself is that the loop in the former case is done on
Clevel, which can be faster, but sometimes the difference is not even noticeable:Looks like a lot, but be aware of the unit, so basically the loop finishes in
0.32secwhile theRloop takes0.0000001sec. In relative terms this is a huge improvement, but in absolute terms you barely will see a difference.Bottom line is that you should not be afraid of
R-level loops, b/c they are far better than their reputation. If there's a natural replacement for a loop, go for it by all means, but do not hesitate to use them when the price for avoiding them is an over-bloated data structure.Now to your problem at hand.
You could transform your data into long format and take the last row by individual. You can go even further and use
data.tablewith an appropriate key:The
tidyverselong format solution is worse than thepurrrloop, thedata.tablelong format solution, is better (not accounting for the overhead of creating the data structure), the filter method lies in between and base loops are at par withpurrrso even the slight overhead ofpurrris negligible.There will be even more solutions, but from this first quick test, it seems to me that the
listsolution is not that bad overall, and, in my opinion equally (if not even more) importantly, quite easy to read and understand.TL;DR: Don't be afraid of loops, if there is no straight forward natural vectorization method use them by all means.