I read that using seq_along()
allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame
exists or in the presence of an empty data.frame
?
Under the condition that
df <- data.frame()
, we have:Case 1 falling victim to...
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to
ncol(df)
being0
. This leads the sequence1:ncol(df)
to be1:0
, which creates the vectorc(1,0)
. In this case, thefor
loop tries to access the first element of the vector1
, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.Meanwhile, in Case 2 and 3 the
for
loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of0
.As this question specifically relates to what the heck is happening to
seq_along()
, let's take a traditionalseq_along
example by constructing a full vectora
and seeing the results:In essence, for each element of the vector
a
, there is a corresponding index that was created byseq_along
to be accessed.If we apply
seq_along
now to the emptydf
in the above case, we get:Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the
data.frame
, which is a very bad assumption for any kind of developer to make...All three cases would be operating as expected. That is, you would receive a median per column of the
data.frame
.