I have a batch job that reads through approximately 33 million rows in Cassandra, using the AllRowsReader
as described in the Astyanax wiki:
new AllRowsReader.Builder<>(getKeyspace(), columnFamily)
.withPageSize(100)
.withIncludeEmptyRows(false)
.withConcurrencyLevel(1)
.forEachRow(
row -> {
try {
return processRow(row);
} catch (Exception e) {
LOG.error("Error while processing row!", e);
return false;
}
}
)
.build()
.call();
If some sort of error stops the batch job, I would like to be able to pick up and continue reading from the row where it stopped, so that I don't have to start reading from the first row again. Is there any fast and simple way to do this?
Or isn't the AllRowsReader
the right fit for this kind of task?
Since nobody has answered let me try this one. Cassandra uses partitioners to determine in which node it should place the row. There are mainly two type of partitioners: 1) Ordered 2) Unordered
https://docs.datastax.com/en/cassandra/2.2/cassandra/architecture/archPartitionerAbout.html
In case of Ordered Partitioner, rows are placed according to the lexicographic order.But in case of Unordered Partitioner you dont have any way to know about the order.
Ordered Partitioner are regarded as anti-pattern in cassandra because it makes cluster distribution pretty difficult. https://docs.datastax.com/en/cassandra/2.2/cassandra/planning/planPlanningAntiPatterns.html
I am assuming you should be using unordered partitioner in your code. So currently there is no way to tell cassandra which is using unordered partitioner that start from this particular row.
I hope this answers your question