I am using JDBC to process some data where I am using gemfireXD as DB which is an InMemory data grid and a NewSQL solution.
Step1 - I am getting all distinct keys(field1) from a table (Table1)
Statement stmt = conn.createStatement();
stmt.execute("select distinct field1 from Table1");
ResultSet rs = stmt.getResultSet();
List<String> keyList = new ArrayList<String> ();
while (rs.next()) {
keyList.add(rs.getString("field1"));
}
Step2 - I am querying the same table for each key from the above list. This should return me 2 records for each key.
I need to process these 2 records for each key and create one final processed record out of these 2 rows for the key, which finally need to be stored in a separate table (table2).
for (int i = 0; i< fieldList.size(); i++) {
stmt.execute("select * from Table1 where field1=" + keyList.get(i));
ResultSet rs = stmt.getResultSet();
// the resultset should contain 2(or more) rows for the key which need to be processed based on some logic.
// Finally need to create 1 processed record for the given key.
// Insert the processed record to Table2
}
Since I have millions of records in Table1, The above way of processing data is time consuming.
I need to parallelize the processing in gemfireXD.
I can do this same processing in PIG or SPARK within few hours using their parallel processing features(creating Tuples/bags of records and using MR programming).
I am hoping gemfireXD must have some facilities to do parallel processing of data, though I have tried DATAAWARE procedure in gemfireXD and used it to parallelize the procedure call, but it did not work for me.
I am hoping gemfire XD must have some other approach for this purpose
Could any one please suggest any implementation approach in gemfireXD to achieve the desired result?
I missed that you're using a NoSQL database. Still, my answer may apply.
You (most probably) don't. You surely need to access the DB efficiently:
For example, use the statement
read two rows (or, more generally, read until
field1
changes) and process them. Save yourself millions of queries!While
ORDER BY
may be inefficient for a NoSQL database, you can surely do better than millions of queries as they mean millions of round trips with waiting and waiting. For example, you could issue 1000 range queries loading 1000 rows into memory each.(*) IMHO all performance-related questions using something like
where field1=" + keyList.get(i)
should be closed immediately. Concatenated SQL is plain wrong.