I've been working since the beginning of 2016 on implementing a Pandas/R DataFrame implementation for Go: https://github.com/kniren/gota.
Recently, I've been focusing on improving the performance of the library to try to match that of Pandas/Dplyr. You can follow the progress so far here: https://github.com/kniren/gota/issues/16
Since one of the more frequently used operations is the DataFrame subsetting, I thought it could be a good idea to introduce concurrency to try to increase the performance of the system.
Before:
columns := make([]series.Series, df.ncols)
for i, column := range df.columns {
s := column.Subset(indexes)
columns[i] = s
}
After:
columns := make([]series.Series, df.ncols)
var wg sync.WaitGroup
wg.Add(df.ncols)
for i := range df.columns {
go func(i int) {
columns[i] = df.columns[i].Subset(indexes)
wg.Done()
}(i)
}
wg.Wait()
As far as I understand, creating a goroutine for each of the columns of a DataFrame should not introduce much overhead, so I was expecting to achieve at least a x2 speedup with respect to the serial version (At least for large datasets). However, when benchmarking this change with datasets and indexes of different sizes the results are very disappointing (NROWSxNCOLS_INDEXSIZE-CPUCORES):
benchmark old ns/op new ns/op delta
BenchmarkDataFrame_Subset/1000000x20_100 55230 109349 +97.99%
BenchmarkDataFrame_Subset/1000000x20_100-2 51457 67714 +31.59%
BenchmarkDataFrame_Subset/1000000x20_100-4 49845 70141 +40.72%
BenchmarkDataFrame_Subset/1000000x20_1000 518506 518085 -0.08%
BenchmarkDataFrame_Subset/1000000x20_1000-2 476661 311379 -34.67%
BenchmarkDataFrame_Subset/1000000x20_1000-4 505023 316583 -37.31%
BenchmarkDataFrame_Subset/1000000x20_10000 6621116 6314112 -4.64%
BenchmarkDataFrame_Subset/1000000x20_10000-2 7316062 4509601 -38.36%
BenchmarkDataFrame_Subset/1000000x20_10000-4 6483812 8394113 +29.46%
BenchmarkDataFrame_Subset/1000000x20_100000 105341711 106427967 +1.03%
BenchmarkDataFrame_Subset/1000000x20_100000-2 94567729 56778647 -39.96%
BenchmarkDataFrame_Subset/1000000x20_100000-4 91896690 60971444 -33.65%
BenchmarkDataFrame_Subset/1000000x20_1000000 1538680081 1632044752 +6.07%
BenchmarkDataFrame_Subset/1000000x20_1000000-2 1292113119 1100075806 -14.86%
BenchmarkDataFrame_Subset/1000000x20_1000000-4 1282367864 949615298 -25.95%
BenchmarkDataFrame_Subset/100000x20_100 50286 106850 +112.48%
BenchmarkDataFrame_Subset/100000x20_100-2 54537 70492 +29.26%
BenchmarkDataFrame_Subset/100000x20_100-4 58024 76617 +32.04%
BenchmarkDataFrame_Subset/100000x20_1000 541600 625967 +15.58%
BenchmarkDataFrame_Subset/100000x20_1000-2 493894 362894 -26.52%
BenchmarkDataFrame_Subset/100000x20_1000-4 535373 349211 -34.77%
BenchmarkDataFrame_Subset/100000x20_10000 6298063 7678499 +21.92%
BenchmarkDataFrame_Subset/100000x20_10000-2 5827185 4832560 -17.07%
BenchmarkDataFrame_Subset/100000x20_10000-4 8195048 3660077 -55.34%
BenchmarkDataFrame_Subset/100000x20_100000 105108807 82976477 -21.06%
BenchmarkDataFrame_Subset/100000x20_100000-2 92112736 58317114 -36.69%
BenchmarkDataFrame_Subset/100000x20_100000-4 92044966 63469935 -31.04%
BenchmarkDataFrame_Subset/1000x20_10 9741 53365 +447.84%
BenchmarkDataFrame_Subset/1000x20_10-2 9366 36457 +289.25%
BenchmarkDataFrame_Subset/1000x20_10-4 9463 46682 +393.31%
BenchmarkDataFrame_Subset/1000x20_100 50841 103523 +103.62%
BenchmarkDataFrame_Subset/1000x20_100-2 49972 62344 +24.76%
BenchmarkDataFrame_Subset/1000x20_100-4 72014 81808 +13.60%
BenchmarkDataFrame_Subset/1000x20_1000 457799 571292 +24.79%
BenchmarkDataFrame_Subset/1000x20_1000-2 460551 405116 -12.04%
BenchmarkDataFrame_Subset/1000x20_1000-4 462928 416522 -10.02%
BenchmarkDataFrame_Subset/1000x200_10 90125 688443 +663.88%
BenchmarkDataFrame_Subset/1000x200_10-2 85259 392705 +360.60%
BenchmarkDataFrame_Subset/1000x200_10-4 87412 387509 +343.31%
BenchmarkDataFrame_Subset/1000x200_100 486600 1082901 +122.54%
BenchmarkDataFrame_Subset/1000x200_100-2 471154 732304 +55.43%
BenchmarkDataFrame_Subset/1000x200_100-4 542846 659571 +21.50%
BenchmarkDataFrame_Subset/1000x200_1000 5926086 6686480 +12.83%
BenchmarkDataFrame_Subset/1000x200_1000-2 5364091 3986970 -25.67%
BenchmarkDataFrame_Subset/1000x200_1000-4 5904977 4504084 -23.72%
BenchmarkDataFrame_Subset/1000x2000_10 1187297 7800052 +556.96%
BenchmarkDataFrame_Subset/1000x2000_10-2 1217022 3930742 +222.98%
BenchmarkDataFrame_Subset/1000x2000_10-4 1301666 3617871 +177.94%
BenchmarkDataFrame_Subset/1000x2000_100 6942015 10790196 +55.43%
BenchmarkDataFrame_Subset/1000x2000_100-2 6588351 7592847 +15.25%
BenchmarkDataFrame_Subset/1000x2000_100-4 7067226 14391327 +103.63%
BenchmarkDataFrame_Subset/1000x2000_1000 62392457 69560711 +11.49%
BenchmarkDataFrame_Subset/1000x2000_1000-2 57793006 37416703 -35.26%
BenchmarkDataFrame_Subset/1000x2000_1000-4 59572261 58398203 -1.97%
benchmark old allocs new allocs delta
BenchmarkDataFrame_Subset/1000000x20_100 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_100-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_100-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_10000 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_10000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_10000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_100000 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_100000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_100000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000000 41 42 +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000000-2 41 43 +4.88%
BenchmarkDataFrame_Subset/1000000x20_1000000-4 41 46 +12.20%
BenchmarkDataFrame_Subset/100000x20_100 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_100-2 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_100-4 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_1000 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_1000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_1000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_10000 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_10000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_10000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_100000 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_100000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/100000x20_100000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_10 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_10-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_10-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_100 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_100-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_100-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_1000 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_1000-2 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x20_1000-4 41 42 +2.44%
BenchmarkDataFrame_Subset/1000x200_10 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_10-2 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_10-4 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_100 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_100-2 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_100-4 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_1000 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_1000-2 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x200_1000-4 401 402 +0.25%
BenchmarkDataFrame_Subset/1000x2000_10 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_10-2 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_10-4 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_100 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_100-2 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_100-4 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_1000 4001 4002 +0.02%
BenchmarkDataFrame_Subset/1000x2000_1000-2 4001 4010 +0.22%
BenchmarkDataFrame_Subset/1000x2000_1000-4 4001 4003 +0.05%
benchmark old bytes new bytes delta
BenchmarkDataFrame_Subset/1000000x20_100 32400 32416 +0.05%
BenchmarkDataFrame_Subset/1000000x20_100-2 32400 32416 +0.05%
BenchmarkDataFrame_Subset/1000000x20_100-4 32400 32416 +0.05%
BenchmarkDataFrame_Subset/1000000x20_1000 298880 298896 +0.01%
BenchmarkDataFrame_Subset/1000000x20_1000-2 298880 298896 +0.01%
BenchmarkDataFrame_Subset/1000000x20_1000-4 298880 298896 +0.01%
BenchmarkDataFrame_Subset/1000000x20_10000 2971520 2971536 +0.00%
BenchmarkDataFrame_Subset/1000000x20_10000-2 2971520 2971536 +0.00%
BenchmarkDataFrame_Subset/1000000x20_10000-4 2971520 2971536 +0.00%
BenchmarkDataFrame_Subset/1000000x20_100000 29083520 29083536 +0.00%
BenchmarkDataFrame_Subset/1000000x20_100000-2 29083520 29083547 +0.00%
BenchmarkDataFrame_Subset/1000000x20_100000-4 29083542 29083563 +0.00%
BenchmarkDataFrame_Subset/1000000x20_1000000 290121600 290121616 +0.00%
BenchmarkDataFrame_Subset/1000000x20_1000000-2 290121600 290121696 +0.00%
BenchmarkDataFrame_Subset/1000000x20_1000000-4 290121600 290121840 +0.00%
BenchmarkDataFrame_Subset/100000x20_100 32400 32416 +0.05%
BenchmarkDataFrame_Subset/100000x20_100-2 32400 32416 +0.05%
BenchmarkDataFrame_Subset/100000x20_100-4 32400 32416 +0.05%
BenchmarkDataFrame_Subset/100000x20_1000 298880 298896 +0.01%
BenchmarkDataFrame_Subset/100000x20_1000-2 298880 298896 +0.01%
BenchmarkDataFrame_Subset/100000x20_1000-4 298880 298896 +0.01%
BenchmarkDataFrame_Subset/100000x20_10000 2971520 2971536 +0.00%
BenchmarkDataFrame_Subset/100000x20_10000-2 2971520 2971536 +0.00%
BenchmarkDataFrame_Subset/100000x20_10000-4 2971520 2971536 +0.00%
BenchmarkDataFrame_Subset/100000x20_100000 29083520 29083536 +0.00%
BenchmarkDataFrame_Subset/100000x20_100000-2 29083520 29083536 +0.00%
BenchmarkDataFrame_Subset/100000x20_100000-4 29083542 29083553 +0.00%
BenchmarkDataFrame_Subset/1000x20_10 4880 4896 +0.33%
BenchmarkDataFrame_Subset/1000x20_10-2 4880 4896 +0.33%
BenchmarkDataFrame_Subset/1000x20_10-4 4880 4896 +0.33%
BenchmarkDataFrame_Subset/1000x20_100 32400 32416 +0.05%
BenchmarkDataFrame_Subset/1000x20_100-2 32400 32416 +0.05%
BenchmarkDataFrame_Subset/1000x20_100-4 32400 32416 +0.05%
BenchmarkDataFrame_Subset/1000x20_1000 298880 298896 +0.01%
BenchmarkDataFrame_Subset/1000x20_1000-2 298880 298896 +0.01%
BenchmarkDataFrame_Subset/1000x20_1000-4 298880 298896 +0.01%
BenchmarkDataFrame_Subset/1000x200_10 49568 49584 +0.03%
BenchmarkDataFrame_Subset/1000x200_10-2 49568 49584 +0.03%
BenchmarkDataFrame_Subset/1000x200_10-4 49568 49585 +0.03%
BenchmarkDataFrame_Subset/1000x200_100 324768 324784 +0.00%
BenchmarkDataFrame_Subset/1000x200_100-2 324768 324784 +0.00%
BenchmarkDataFrame_Subset/1000x200_100-4 324768 324784 +0.00%
BenchmarkDataFrame_Subset/1000x200_1000 2989568 2989584 +0.00%
BenchmarkDataFrame_Subset/1000x200_1000-2 2989568 2989584 +0.00%
BenchmarkDataFrame_Subset/1000x200_1000-4 2989569 2989588 +0.00%
BenchmarkDataFrame_Subset/1000x2000_10 491072 491088 +0.00%
BenchmarkDataFrame_Subset/1000x2000_10-2 491072 491133 +0.01%
BenchmarkDataFrame_Subset/1000x2000_10-4 491072 491088 +0.00%
BenchmarkDataFrame_Subset/1000x2000_100 3243072 3243088 +0.00%
BenchmarkDataFrame_Subset/1000x2000_100-2 3243074 3243102 +0.00%
BenchmarkDataFrame_Subset/1000x2000_100-4 3243076 3243100 +0.00%
BenchmarkDataFrame_Subset/1000x2000_1000 29891072 29891088 +0.00%
BenchmarkDataFrame_Subset/1000x2000_1000-2 29891086 29891797 +0.00%
BenchmarkDataFrame_Subset/1000x2000_1000-4 29891115 29891167 +0.00%
Running the profiler (cpu/mem) over this benchmark didn't seem to reveal nothing significant. The concurrent version seem to spend some time on rumtime.match_semaphore_signal
but I guess this is to be expected when waiting for the goroutines to finish.
I've tried limiting the number of goroutines launched to the maximum number of cores as reported by runtime.GOMAXPROCS(0)
but the results are somewhat even worse. Am I doing something horribly wrong here or is the goroutines overhead so big that it has such a significant effect on the performance?
Goroutines are cheap, but not free.
I didn't read your code, but if you are spawning NCOLS_INDEXSIZE goroutines for each row you process, then it's a very bad practice.
This can be seen in your benchmark where you have 2k columns and only 1k rows - you get very big improvement. But in all other cases, when number of columns << number of rows, goroutine spawning becomes a bottleneck.
Instead you should spawn a pool of goroutines (close to your CPU count) and distribute work between them through channels - it's the canonical way. You may want to read https://blog.golang.org/pipelines