I am using the partools
package to run linear regressions in parallel. I am doing this using the calm()
function, which is a wrapper for the package's version of R's lm()
.
I'm using 20 cores on a 64gb node.
I receive errors when I run the calm()
function, and I've isolated the problem to a single variable: agelvl
. Since partools
must split a dataset into chunks (the number of chunks equaling the number of cores to be used), variables, from what I can tell, are stored as either character or integer. agelvl
is stored as a character due to it's named levels, so I use factor()
around it in the function.
Here's the code:
lpmvbac2<-calm(cls,'vbac ~ factor(agelvl),data=nat[nat$prec==1,]')$tht
Here's the error:
Error in cabase(cls, ovf, coef, vcov) :
likely cause is constant variable in some chunk
Calls: calm -> cabase
In addition: Warning message:
In f(init, x[[i]]) :
number of columns of result is not a multiple of vector length (arg 2)
When I run the above code on my local machine (although, using 3 cores, instead of 20), I can't reproduce the error. This would suggest that the problem occurs in the chunking, specifically that a given level of agelvl
is missing from one or more chunks.
However, here's a summary of agelvl
in the unchunked data:
under 15 15-19 20-24 25-29 30-34 35-39 40-44 45-49
7440 336242 698606 770127 620437 267777 48342 2176
It seems unlikely to me that split into 20 chunks, any one of those 20 chunks would be missing any of these levels. I even checked each 20 chunks individually, and I don't see any levels missing:
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16732 34284 37552 30392 13225 2410 105 382
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16774 34906 38727 31012 13469 2445 113 386
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
17007 34762 38820 31159 13311 2326 104 344
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16836 34839 38387 31251 13594 2429 91 405
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16621 35150 38519 31103 13470 2505 109 355
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16768 35020 38673 31034 13379 2467 97 395
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16724 35036 38376 31211 13473 2538 120 354
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16948 34831 38714 31013 13486 2373 107 361
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16948 34807 38845 30801 13532 2432 107 360
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16746 35042 38581 31184 13369 2381 130 400
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16796 35045 38616 31200 13351 2335 111 378
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16837 35298 38579 30858 13369 2424 106 361
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16882 34955 38529 31136 13403 2459 104 365
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16839 35096 38360 31210 13383 2462 106 376
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
17109 35106 38450 30991 13322 2377 112 366
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16869 35118 38310 31083 13426 2530 122 374
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16850 34885 38768 31210 13284 2371 101 363
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16644 35086 38968 30840 13450 2378 103 364
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16707 35086 38762 31010 13371 2387 121 388
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16605 34254 37591 30739 13110 2313 107 363
Interestingly, when I split the data into 3 chunks and use 3 cores on the cluster, instead of 20, I get it to run, just as I'm able to on my local machine.
So, why does this problem occur when using 20 cores but not 3?
According to the author of
partools
, this could be a scaling issue -- so that, even if no levels of a categorical variable are missing in any one chunk, the error may still occur because the number of observations in a given level are both absolutely and relatively low.Solutions
Decrease the number of chunks: assuming there is a point at which the error will disappear, you can decrease the number of chunks; however, this also means that you are decreasing the number of cores you will use which means that (a) each chunk may be so large so that you run into memory problems or (b) the parallel processes now run too slow, or (c) both.
Alter the levels/variable structure: you can leave the desired number of chunks/cores as-is, and simply alter the levels so that each level has a critical number of observations. For
agelvl
, you could increase the intervals (10 years, instead of 5), or, if possible, change age from a categorical variable to a continuous one. One should keep in mind that such changes could alter the explanatory power of the model or cause the model to be incorrectly specified.