I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows: I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want. However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination. The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way? I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this? Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.