Running a SELECT query with ALLOW FILTERING on Amazon Keyspaces returns no data

356 views Asked by At

We have implemented cassandra using the aws keyspace service, and using the cassandra-driver for node. The client works fine and we are able to perform create and update operations on the data. However, running a simple cql query using the client returns no data (empty rows).

When I run the exact same query on cql editor on aws dashboard, it works fine and does return the required data.

Query:

SELECT * FROM <TABLE_NAME> WHERE product_id = '<PRODUCT_ID>' ALLOW FILTERING

Running the same query on the cql editor on aws dashboard works fine.

2

There are 2 answers

0
MikeJPR On

Yes you are using part of the partition key in your query statement. Keyspaces will filter the rows on the storage layer. After scanning a given amount it will return back to the client. If the filter criteria did not find results it will result in an empty page. A page with no rows. It's a safe guard to avoid unbounded request time. You can continue to iterate over the pages until you reach the end of the iterator.

The approach above is useful when grabbing first X rows, but for full table scans I would recommend using AWS Glue.

In the follow example we perform the same product lookup but with Spark and Glue. Under the hood Spark will parse, and paginate through the results. Glue will provision the memory and compute resources. In this job we export the results to s3. General full table scan can be TBs in size. This architecture will work for small or large tables since it uses serverless resources.

Job.init(args("JOB_NAME"), glueContext, args.asJava)

    val tableName = args("TABLE_NAME")
    val keyspaceName = args("KEYSPACE_NAME")
    val backupS3 = args("S3_URI")
    val backupFormat = args("FORMAT")
    val productId = args("PRODUCT_ID")

    val tableDf = sparkSession.read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> tableName, "keyspace" -> keyspaceName))
      .load()
      .where(col("product_id") === productId)

    tableDf.write.format(backupFormat).mode(SaveMode.ErrorIfExists).save(backupS3)

    Job.commit()

You can find the full export to s3 example and glue create script here. here

1
Erick Ramirez On

I noted that you're using the ALLOW FILTERING clause in your query which suggests that product_id is not the table's partition key.

If you're querying the table without the partition key, the coordinator of the request has to contact all nodes in the cluster to get requested data. There's a good chance that the query is timing out waiting for responses from all nodes so no data is returned.

I suspect it works in cqlsh because it has a higher timeout than the default for the Node.js driver.

You will need to review the application logs and look for warnings or errors from the Node.js driver for clues. Cheers!


Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. Thanks!