GemfireXD - How to parallelize data processing for bigger data size

159 views Asked by At

I am using JDBC to process some data where I am using gemfireXD as DB which is an InMemory data grid and a NewSQL solution.

Step1 - I am getting all distinct keys(field1) from a table (Table1)

             Statement stmt = conn.createStatement();
             stmt.execute("select distinct field1 from Table1");
             ResultSet rs = stmt.getResultSet();
             List<String> keyList = new ArrayList<String> ();
             while (rs.next()) {
                   keyList.add(rs.getString("field1"));
             }

Step2 - I am querying the same table for each key from the above list. This should return me 2 records for each key.

I need to process these 2 records for each key and create one final processed record out of these 2 rows for the key, which finally need to be stored in a separate table (table2).

    for (int i = 0; i< fieldList.size(); i++) {
        stmt.execute("select * from Table1 where field1=" + keyList.get(i));
        ResultSet rs = stmt.getResultSet();

        // the resultset should contain 2(or more) rows for the key which need to be processed based on some logic.
        // Finally need to create 1 processed record for the given key.
        // Insert the processed record to Table2
    }

Since I have millions of records in Table1, The above way of processing data is time consuming.

I need to parallelize the processing in gemfireXD.

I can do this same processing in PIG or SPARK within few hours using their parallel processing features(creating Tuples/bags of records and using MR programming).

I am hoping gemfireXD must have some facilities to do parallel processing of data, though I have tried DATAAWARE procedure in gemfireXD and used it to parallelize the procedure call, but it did not work for me.

I am hoping gemfire XD must have some other approach for this purpose

Could any one please suggest any implementation approach in gemfireXD to achieve the desired result?

3

There are 3 answers

3
maaartinus On

I missed that you're using a NoSQL database. Still, my answer may apply.

I need to parallelize the processing in gemfireXD.

You (most probably) don't. You surely need to access the DB efficiently:

  • use prepared statements(*)
  • use some SQL power

For example, use the statement

select * from Table1 order by field1

read two rows (or, more generally, read until field1 changes) and process them. Save yourself millions of queries!


While ORDER BY may be inefficient for a NoSQL database, you can surely do better than millions of queries as they mean millions of round trips with waiting and waiting. For example, you could issue 1000 range queries loading 1000 rows into memory each.


(*) IMHO all performance-related questions using something like where field1=" + keyList.get(i) should be closed immediately. Concatenated SQL is plain wrong.

4
William Markito On

If I understand your requirements correctly, you may be looking into ways to process only keys that may be local to that node so you can distributed the processing as you have said, using DAP (Data Aware Procedures) in a partitioned table.

Partitioning will take care of sharding your data, then you can process it using a DAP querying local data (local to that particular node) and insert it back into your table2. The other consideration here is colocation, since you may actually need some colocation in order to make both tables somehow to be collocated according to some business logic.

Please take a look at the DAP merge sort example.

If you are still having problems with DAP's please share the details or exceptions you may be getting.

Hope that helps.

References:

0
buqing On

Why not use join instead of process them one by one?

GemfireXD can use joins when 2 tables are colocated.

I recommend you create another table to store your first query and partition it using the same partition key with your table and then join them to get your result. Let's make call your original table as table1 and the temporary table as tmp. You can get your result using the following query. It should be much faster than your iterative method

select s.* from table1 s join tmp t on s.field1=t.field1