Goroutines overhead and performance analysis when subsetting DataFrames (Gota)

1.6k views Asked by At

I've been working since the beginning of 2016 on implementing a Pandas/R DataFrame implementation for Go: https://github.com/kniren/gota.

Recently, I've been focusing on improving the performance of the library to try to match that of Pandas/Dplyr. You can follow the progress so far here: https://github.com/kniren/gota/issues/16

Since one of the more frequently used operations is the DataFrame subsetting, I thought it could be a good idea to introduce concurrency to try to increase the performance of the system.

Before:

columns := make([]series.Series, df.ncols)
for i, column := range df.columns {
    s := column.Subset(indexes)
    columns[i] = s
}

After:

columns := make([]series.Series, df.ncols)
var wg sync.WaitGroup
wg.Add(df.ncols)
for i := range df.columns {
    go func(i int) {
        columns[i] = df.columns[i].Subset(indexes)
        wg.Done()
    }(i)
}
wg.Wait()

As far as I understand, creating a goroutine for each of the columns of a DataFrame should not introduce much overhead, so I was expecting to achieve at least a x2 speedup with respect to the serial version (At least for large datasets). However, when benchmarking this change with datasets and indexes of different sizes the results are very disappointing (NROWSxNCOLS_INDEXSIZE-CPUCORES):

benchmark                                          old ns/op      new ns/op      delta
BenchmarkDataFrame_Subset/1000000x20_100           55230          109349         +97.99%
BenchmarkDataFrame_Subset/1000000x20_100-2         51457          67714          +31.59%
BenchmarkDataFrame_Subset/1000000x20_100-4         49845          70141          +40.72%
BenchmarkDataFrame_Subset/1000000x20_1000          518506         518085         -0.08%
BenchmarkDataFrame_Subset/1000000x20_1000-2        476661         311379         -34.67%
BenchmarkDataFrame_Subset/1000000x20_1000-4        505023         316583         -37.31%
BenchmarkDataFrame_Subset/1000000x20_10000         6621116        6314112        -4.64%
BenchmarkDataFrame_Subset/1000000x20_10000-2       7316062        4509601        -38.36%
BenchmarkDataFrame_Subset/1000000x20_10000-4       6483812        8394113        +29.46%
BenchmarkDataFrame_Subset/1000000x20_100000        105341711      106427967      +1.03%
BenchmarkDataFrame_Subset/1000000x20_100000-2      94567729       56778647       -39.96%
BenchmarkDataFrame_Subset/1000000x20_100000-4      91896690       60971444       -33.65%
BenchmarkDataFrame_Subset/1000000x20_1000000       1538680081     1632044752     +6.07%
BenchmarkDataFrame_Subset/1000000x20_1000000-2     1292113119     1100075806     -14.86%
BenchmarkDataFrame_Subset/1000000x20_1000000-4     1282367864     949615298      -25.95%
BenchmarkDataFrame_Subset/100000x20_100            50286          106850         +112.48%
BenchmarkDataFrame_Subset/100000x20_100-2          54537          70492          +29.26%
BenchmarkDataFrame_Subset/100000x20_100-4          58024          76617          +32.04%
BenchmarkDataFrame_Subset/100000x20_1000           541600         625967         +15.58%
BenchmarkDataFrame_Subset/100000x20_1000-2         493894         362894         -26.52%
BenchmarkDataFrame_Subset/100000x20_1000-4         535373         349211         -34.77%
BenchmarkDataFrame_Subset/100000x20_10000          6298063        7678499        +21.92%
BenchmarkDataFrame_Subset/100000x20_10000-2        5827185        4832560        -17.07%
BenchmarkDataFrame_Subset/100000x20_10000-4        8195048        3660077        -55.34%
BenchmarkDataFrame_Subset/100000x20_100000         105108807      82976477       -21.06%
BenchmarkDataFrame_Subset/100000x20_100000-2       92112736       58317114       -36.69%
BenchmarkDataFrame_Subset/100000x20_100000-4       92044966       63469935       -31.04%
BenchmarkDataFrame_Subset/1000x20_10               9741           53365          +447.84%
BenchmarkDataFrame_Subset/1000x20_10-2             9366           36457          +289.25%
BenchmarkDataFrame_Subset/1000x20_10-4             9463           46682          +393.31%
BenchmarkDataFrame_Subset/1000x20_100              50841          103523         +103.62%
BenchmarkDataFrame_Subset/1000x20_100-2            49972          62344          +24.76%
BenchmarkDataFrame_Subset/1000x20_100-4            72014          81808          +13.60%
BenchmarkDataFrame_Subset/1000x20_1000             457799         571292         +24.79%
BenchmarkDataFrame_Subset/1000x20_1000-2           460551         405116         -12.04%
BenchmarkDataFrame_Subset/1000x20_1000-4           462928         416522         -10.02%
BenchmarkDataFrame_Subset/1000x200_10              90125          688443         +663.88%
BenchmarkDataFrame_Subset/1000x200_10-2            85259          392705         +360.60%
BenchmarkDataFrame_Subset/1000x200_10-4            87412          387509         +343.31%
BenchmarkDataFrame_Subset/1000x200_100             486600         1082901        +122.54%
BenchmarkDataFrame_Subset/1000x200_100-2           471154         732304         +55.43%
BenchmarkDataFrame_Subset/1000x200_100-4           542846         659571         +21.50%
BenchmarkDataFrame_Subset/1000x200_1000            5926086        6686480        +12.83%
BenchmarkDataFrame_Subset/1000x200_1000-2          5364091        3986970        -25.67%
BenchmarkDataFrame_Subset/1000x200_1000-4          5904977        4504084        -23.72%
BenchmarkDataFrame_Subset/1000x2000_10             1187297        7800052        +556.96%
BenchmarkDataFrame_Subset/1000x2000_10-2           1217022        3930742        +222.98%
BenchmarkDataFrame_Subset/1000x2000_10-4           1301666        3617871        +177.94%
BenchmarkDataFrame_Subset/1000x2000_100            6942015        10790196       +55.43%
BenchmarkDataFrame_Subset/1000x2000_100-2          6588351        7592847        +15.25%
BenchmarkDataFrame_Subset/1000x2000_100-4          7067226        14391327       +103.63%
BenchmarkDataFrame_Subset/1000x2000_1000           62392457       69560711       +11.49%
BenchmarkDataFrame_Subset/1000x2000_1000-2         57793006       37416703       -35.26%
BenchmarkDataFrame_Subset/1000x2000_1000-4         59572261       58398203       -1.97%

benchmark                                          old allocs     new allocs     delta
BenchmarkDataFrame_Subset/1000000x20_100           41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_100-2         41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_100-4         41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000          41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000-2        41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000-4        41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_10000         41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_10000-2       41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_10000-4       41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_100000        41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_100000-2      41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_100000-4      41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000000       41             42             +2.44%
BenchmarkDataFrame_Subset/1000000x20_1000000-2     41             43             +4.88%
BenchmarkDataFrame_Subset/1000000x20_1000000-4     41             46             +12.20%
BenchmarkDataFrame_Subset/100000x20_100            41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_100-2          41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_100-4          41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_1000           41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_1000-2         41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_1000-4         41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_10000          41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_10000-2        41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_10000-4        41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_100000         41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_100000-2       41             42             +2.44%
BenchmarkDataFrame_Subset/100000x20_100000-4       41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_10               41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_10-2             41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_10-4             41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_100              41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_100-2            41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_100-4            41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_1000             41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_1000-2           41             42             +2.44%
BenchmarkDataFrame_Subset/1000x20_1000-4           41             42             +2.44%
BenchmarkDataFrame_Subset/1000x200_10              401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_10-2            401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_10-4            401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_100             401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_100-2           401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_100-4           401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_1000            401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_1000-2          401            402            +0.25%
BenchmarkDataFrame_Subset/1000x200_1000-4          401            402            +0.25%
BenchmarkDataFrame_Subset/1000x2000_10             4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_10-2           4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_10-4           4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_100            4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_100-2          4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_100-4          4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_1000           4001           4002           +0.02%
BenchmarkDataFrame_Subset/1000x2000_1000-2         4001           4010           +0.22%
BenchmarkDataFrame_Subset/1000x2000_1000-4         4001           4003           +0.05%

benchmark                                          old bytes     new bytes     delta
BenchmarkDataFrame_Subset/1000000x20_100           32400         32416         +0.05%
BenchmarkDataFrame_Subset/1000000x20_100-2         32400         32416         +0.05%
BenchmarkDataFrame_Subset/1000000x20_100-4         32400         32416         +0.05%
BenchmarkDataFrame_Subset/1000000x20_1000          298880        298896        +0.01%
BenchmarkDataFrame_Subset/1000000x20_1000-2        298880        298896        +0.01%
BenchmarkDataFrame_Subset/1000000x20_1000-4        298880        298896        +0.01%
BenchmarkDataFrame_Subset/1000000x20_10000         2971520       2971536       +0.00%
BenchmarkDataFrame_Subset/1000000x20_10000-2       2971520       2971536       +0.00%
BenchmarkDataFrame_Subset/1000000x20_10000-4       2971520       2971536       +0.00%
BenchmarkDataFrame_Subset/1000000x20_100000        29083520      29083536      +0.00%
BenchmarkDataFrame_Subset/1000000x20_100000-2      29083520      29083547      +0.00%
BenchmarkDataFrame_Subset/1000000x20_100000-4      29083542      29083563      +0.00%
BenchmarkDataFrame_Subset/1000000x20_1000000       290121600     290121616     +0.00%
BenchmarkDataFrame_Subset/1000000x20_1000000-2     290121600     290121696     +0.00%
BenchmarkDataFrame_Subset/1000000x20_1000000-4     290121600     290121840     +0.00%
BenchmarkDataFrame_Subset/100000x20_100            32400         32416         +0.05%
BenchmarkDataFrame_Subset/100000x20_100-2          32400         32416         +0.05%
BenchmarkDataFrame_Subset/100000x20_100-4          32400         32416         +0.05%
BenchmarkDataFrame_Subset/100000x20_1000           298880        298896        +0.01%
BenchmarkDataFrame_Subset/100000x20_1000-2         298880        298896        +0.01%
BenchmarkDataFrame_Subset/100000x20_1000-4         298880        298896        +0.01%
BenchmarkDataFrame_Subset/100000x20_10000          2971520       2971536       +0.00%
BenchmarkDataFrame_Subset/100000x20_10000-2        2971520       2971536       +0.00%
BenchmarkDataFrame_Subset/100000x20_10000-4        2971520       2971536       +0.00%
BenchmarkDataFrame_Subset/100000x20_100000         29083520      29083536      +0.00%
BenchmarkDataFrame_Subset/100000x20_100000-2       29083520      29083536      +0.00%
BenchmarkDataFrame_Subset/100000x20_100000-4       29083542      29083553      +0.00%
BenchmarkDataFrame_Subset/1000x20_10               4880          4896          +0.33%
BenchmarkDataFrame_Subset/1000x20_10-2             4880          4896          +0.33%
BenchmarkDataFrame_Subset/1000x20_10-4             4880          4896          +0.33%
BenchmarkDataFrame_Subset/1000x20_100              32400         32416         +0.05%
BenchmarkDataFrame_Subset/1000x20_100-2            32400         32416         +0.05%
BenchmarkDataFrame_Subset/1000x20_100-4            32400         32416         +0.05%
BenchmarkDataFrame_Subset/1000x20_1000             298880        298896        +0.01%
BenchmarkDataFrame_Subset/1000x20_1000-2           298880        298896        +0.01%
BenchmarkDataFrame_Subset/1000x20_1000-4           298880        298896        +0.01%
BenchmarkDataFrame_Subset/1000x200_10              49568         49584         +0.03%
BenchmarkDataFrame_Subset/1000x200_10-2            49568         49584         +0.03%
BenchmarkDataFrame_Subset/1000x200_10-4            49568         49585         +0.03%
BenchmarkDataFrame_Subset/1000x200_100             324768        324784        +0.00%
BenchmarkDataFrame_Subset/1000x200_100-2           324768        324784        +0.00%
BenchmarkDataFrame_Subset/1000x200_100-4           324768        324784        +0.00%
BenchmarkDataFrame_Subset/1000x200_1000            2989568       2989584       +0.00%
BenchmarkDataFrame_Subset/1000x200_1000-2          2989568       2989584       +0.00%
BenchmarkDataFrame_Subset/1000x200_1000-4          2989569       2989588       +0.00%
BenchmarkDataFrame_Subset/1000x2000_10             491072        491088        +0.00%
BenchmarkDataFrame_Subset/1000x2000_10-2           491072        491133        +0.01%
BenchmarkDataFrame_Subset/1000x2000_10-4           491072        491088        +0.00%
BenchmarkDataFrame_Subset/1000x2000_100            3243072       3243088       +0.00%
BenchmarkDataFrame_Subset/1000x2000_100-2          3243074       3243102       +0.00%
BenchmarkDataFrame_Subset/1000x2000_100-4          3243076       3243100       +0.00%
BenchmarkDataFrame_Subset/1000x2000_1000           29891072      29891088      +0.00%
BenchmarkDataFrame_Subset/1000x2000_1000-2         29891086      29891797      +0.00%
BenchmarkDataFrame_Subset/1000x2000_1000-4         29891115      29891167      +0.00%

Running the profiler (cpu/mem) over this benchmark didn't seem to reveal nothing significant. The concurrent version seem to spend some time on rumtime.match_semaphore_signal but I guess this is to be expected when waiting for the goroutines to finish.

I've tried limiting the number of goroutines launched to the maximum number of cores as reported by runtime.GOMAXPROCS(0) but the results are somewhat even worse. Am I doing something horribly wrong here or is the goroutines overhead so big that it has such a significant effect on the performance?

1

There are 1 answers

0
Igor Novgorodov On

Goroutines are cheap, but not free.

I didn't read your code, but if you are spawning NCOLS_INDEXSIZE goroutines for each row you process, then it's a very bad practice.

This can be seen in your benchmark where you have 2k columns and only 1k rows - you get very big improvement. But in all other cases, when number of columns << number of rows, goroutine spawning becomes a bottleneck.

Instead you should spawn a pool of goroutines (close to your CPU count) and distribute work between them through channels - it's the canonical way. You may want to read https://blog.golang.org/pipelines