using seq_along() to handle the empty case

1.3k views Asked by At

I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.

For example, I have this data frame:

df
            a            b          c          d
1   1.2767671  0.133558438  1.5582137  0.6049921
2  -1.2133819 -0.595845408 -0.9492494 -0.9633872
3   0.4512179  0.425949910  0.1529301 -0.3012190
4   1.4945791  0.211932487 -1.2051334  0.1218442
5   2.0102918  0.135363711  0.2808456  1.1293810
6   1.0827021  0.290615747  2.5339719 -0.3265962
7  -0.1107592 -2.762735937 -0.2428827 -0.3340126
8   0.3439831  0.323193841  0.9623515 -0.1099747
9   0.3794022 -1.306189542  0.6185657  0.5889456
10  1.2966537 -0.004927108 -1.3796625 -1.1577800

Considering these three different code snippets:

# Case 1
for (i in 1:ncol(df)) {
    print(median(df[[i]]))
}

# Case 2
for (i in seq_along(df)) {
    print(median(df[[i]]))
}

# Case 3
for(i in df) print(median(i))

What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?

1

There are 1 answers

1
coatless On BEST ANSWER

Under the condition that df <- data.frame(), we have:

Case 1 falling victim to...

Error in .subset2(x, i, exact = exact) : subscript out of bounds

while Case 2 and 3 are not triggered.

In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.

Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.

As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:

set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5

In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.

If we apply seq_along now to the empty df in the above case, we get:

seq_along(df)
# integer(0)

Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.

Ergo, the Case 1 poorly protects the against the empty case.

Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...

set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))

All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.

[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349