I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that
df <- data.frame(), we have:Case 1 falling victim to...
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to
ncol(df)being0. This leads the sequence1:ncol(df)to be1:0, which creates the vectorc(1,0). In this case, theforloop tries to access the first element of the vector1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.Meanwhile, in Case 2 and 3 the
forloop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of0.As this question specifically relates to what the heck is happening to
seq_along(), let's take a traditionalseq_alongexample by constructing a full vectoraand seeing the results:In essence, for each element of the vector
a, there is a corresponding index that was created byseq_alongto be accessed.If we apply
seq_alongnow to the emptydfin the above case, we get:Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the
data.frame, which is a very bad assumption for any kind of developer to make...All three cases would be operating as expected. That is, you would receive a median per column of the
data.frame.