Issues with Cassandra for when performing large number of writes

433 views Asked by At

We are trying to write a large number of records (upwards of 5 million at a time) into Cassandra. These are being read from tab delimited files and are being imported into Cassandra using executeAsync. We have been using much smaller datasets (~330k records) which will be more common. Until recently, our script has been silently stopping its import at around 65k records. Since upgrading the RAM from 2Gb to 4Gb the number of records importing have since doubled but we are still not successfully importing all the records.

This is an example of the process we are running at present:

$cluster = \Cassandra::cluster()->withContactPoints('127.0.0.1')->build();
$session = $cluster->connect('example_data');

$statement = $session->prepare("INSERT INTO example_table (example_id, column_1, column_2, column_3, column_4, column_5, column_6) VALUES (uuid(), ?, ?, ?, ?, ?, ?)");
$futures = array();
$data = array();

foreach ($results as $row) {
   $data = array($row[‘column_1’], $row[‘column_2’], $row[‘column_3’], $row[‘column_4’], $row[‘column_5’], $row[‘column_6’]);
   $futures = $session->executeAsync($statement, new \Cassandra\ExecutionOptions(array(
       'arguments' => $data
   )));
}

We suspect that this might be down to the heap running out of space:

DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,105  ColumnFamilyStore.java:1153 - Flushing largest CFS(Keyspace='dev', ColumnFamily='example_data') to free up room. Used total: 0.67/0.00, live: 0.33/0.00, flushing: 0.33/0.00, this: 0.20/0.00
DEBUG [SlabPoolCleaner] 2017-02-27 17:01:17,133  ColumnFamilyStore.java:854 - Enqueuing flush of example_data: 89516255 (33%) on-heap, 0 (0%) off-heap

The table we are inserting this data is as follows:

CREATE TABLE example_data (
  example_id uuid PRIMARY KEY,
  column_1 int,
  column_2 varchar,
  column_3 int,
  column_4 varchar,
  column_5 int,
  column_6 int
);
CREATE INDEX column_5 ON example_data (column_5);
CREATE INDEX column_6 ON example_data (column_6);

We have attempted to use the batch method but believe it is not appropriate here as it causes the Cassandra process to run at a high level of CPU usage (~85%).

We are using the latest version of DSE/Cassandra available from the repository.

Cassandra 3.0.11.1564 | DSE 5.0.6
1

There are 1 answers

3
Chris Lohfink On

2gb (and 4gb really) is not even the minimum recommended for Cassandra in development or production. Running on it is possible but it requires more tweaking since its below what the defaults are tuned for. Even tweaked you shouldnt expect much performance before it starts having issues keeping up (errors your getting) and you need to add more nodes.

https://docs.datastax.com/en/landing_page/doc/landing_page/planning/planningHardware.html

  • Production: 32 GB to 512 GB; the minimum is 8 GB for Cassandra only and 32 GB for DataStax Enterprise analytics and search nodes.
  • Development in non-loading testing environments: no less than 4 GB.
  • DSE Graph: 2 to 4 GB in addition to your particular combination of DSE Search or DSE Analytics. If you want a large dedicated graph cache, add more RAM.

Also your spamming writes with executeAsync and not applying any backpressure. Eventually you will overrun any system like that. You either need to add some kind of throttling, feedback, or just use synchronous requests.