We are using "small" memory-only Aerospike server to store website analytics for last hour. Data size for last hour is about 10 Gb.
We tried to execute some aggregation queries from separate server (Java-based client) on Aerospike, something like this (in LUA):
stream : aggregate( map(), complex_aggregate_function ) : reduce( simple_reduce_function )
According to documentation all aggregation is done on Aerospike nodes (single node in our case), and reduce -- on client.
It turns out that aggregate() function process only small batch of data, i.e. 10-16 records. After that aggregation result is sent to client to be processed by reduce().
Since reduce() operation is executed on client, it means server would send at least 1/16 size of data to client. I.e. hundreds of megabytes for our data. Talk about performance.
Is it possible to change "buffer size" or "queue size" or "whatever size" for records stream aggregation? I.e. is it possible to "tune" Aerospike to call reduce() function only once per each node?
There are two aspects to this problem - query batch size and the query buffer size.
Query batch size determines the number of records to be returned in a single batch by the query. Lets say, if your query gives you 1000 records and your query batch size is 1000, all the results will be returned in a single response. If your query batch size is 100, it will take 10 batches to return the entire resultset.
Refer to http://www.aerospike.com/docs/operations/manage/queries/ for further details.
Similarly you can increase the query-buf-size to increase the size of the buffer. A higher buffer size will result in lower batch count.