data.table result of by and join not sorted as I expected, why? Is it for efficiency?

104 views Asked by At

I just discovered that summarizing a data table (I am relatively new to r and data.table) doesn't result in a sorted data tabe, nor does a join:

starting data table:

> DailyDataICount
                     ID week WeekDay      Qty
 1:               33739 1746       2     9369
 2:               33739 1746       3     9981
 3:               33739 1746       4    10109
 4:               33739 1746       5     9012
 5:               33739 1746       6    10387
 6:               33739 1746       7        2
 7:               33739 1747       1     7264
 8:               33739 1747       2     9195
 9:               33739 1747       3     8865
10:               33739 1747       4     8613
11:               33739 1747       5     9154
12:               33739 1747       6     9754
13:               33739 1747       7    11565
14:               33739 1748       1     9210
15:               33739 1748       2     8579
16:               33739 1748       3     8406
17:               33739 1748       4     8161
18:               33739 1748       5     8778
19:               33739 1748       6     9884
20:               33739 1748       7    10365

summarization:

> MeanCount <- DailyDataICount[,mean(Qty),by="ID,WeekDay"]
> setnames(MeanCount,"V1","ObjQty")
> MeanCount
                    ID WeekDay    ObjQty
1:               33739       2  9047.667
2:               33739       3  9084.000
3:               33739       4  8961.000
4:               33739       5  8981.333
5:               33739       6 10008.333
6:               33739       7  7310.667
7:               33739       1  8237.000

join:

> Count_SM_FM
                     ID WeekDay ItemType  QtyType
 1:               33739       2       FM      995
 2:               33739       2       SM    15241
 3:               33739       3       FM     1036
 4:               33739       3       SM    15387
 5:               33739       4       FM     1002
 6:               33739       4       SM    15449
 7:               33739       5       FM      990
 8:               33739       5       SM    15388
 9:               33739       6       FM     1344
10:               33739       6       SM    16016
11:               33739       7       SM    13840
12:               33739       1       FM     1015
13:               33739       1       SM    11456
14:               33739       7       FM     1710
> setkey(Count_SM_FM,ID,WeekDay)
> Count_SM_FM <- Count_SM_FM[MeanCount]
> 
> Count_SM_FM
                     ID WeekDay ItemType  QtyType    ObjQty
 1:               33739       2       FM      995  9047.667
 2:               33739       2       SM    15241  9047.667
 3:               33739       3       FM     1036  9084.000
 4:               33739       3       SM    15387  9084.000
 5:               33739       4       FM     1002  8961.000
 6:               33739       4       SM    15449  8961.000
 7:               33739       5       FM      990  8981.333
 8:               33739       5       SM    15388  8981.333
 9:               33739       6       FM     1344 10008.333
10:               33739       6       SM    16016 10008.333
11:               33739       7       SM    13840  7310.667
12:               33739       7       FM     1710  7310.667
13:               33739       1       FM     1015  8237.000
14:               33739       1       SM    11456  8237.000

is it because it's more efficient? I found this out because I had to join Count_SM_FM with a different data.table but using the same key (which I hadn't set again) and I got an error. I will solve the immediate problem by setting the key again. I know this is a bit of philosophical question, not keeping the order probably usually saves time but if in most cases people need to use the results as sorted...

Thanks!

1

There are 1 answers

0
Arun On BEST ANSWER

The order of elements in a join of the form x[i] is determined by i. In your case, i is MeanCount and x is Count_SM_FM and i is not sorted.

To get the result you seek, you should just do:

setkey(Count_SM_FM, ID, WeekDay)
setkey(MeanCount, ID, WeekDay)   ## Just add this line
ans = Count_SM_FM[MeanCount]
#        ID WeekDay ItemType QtyType    ObjQty
#  1: 33739       1       FM    1015  8237.000
#  2: 33739       1       SM   11456  8237.000
#  3: 33739       2       FM     995  9047.667
#  4: 33739       2       SM   15241  9047.667
#  5: 33739       3       FM    1036  9084.000
#  6: 33739       3       SM   15387  9084.000
#  7: 33739       4       FM    1002  8961.000
#  8: 33739       4       SM   15449  8961.000
#  9: 33739       5       FM     990  8981.333
# 10: 33739       5       SM   15388  8981.333
# 11: 33739       6       FM    1344 10008.333
# 12: 33739       6       SM   16016 10008.333
# 13: 33739       7       SM   13840  7310.667
# 14: 33739       7       FM    1710  7310.667