Batch job to process data on cassandra

660 views Asked by At

I want to store the following data in NoSql. Reason being i will be having a lot of writes but less read (only in batch job when we need to pull the report) so for that i chose Cassandra. Now as you can see my data format. I will get multiple comma separated items viewed by a person in a particular session. Now i am storing single row for every corresponding item. As you can see in the Data Stored in Cassandra section. So, now my problem is let's say if i want to pull the report for all the records for filter Category=10 or filter city=200. So, how would i apply the like or how would i split that column in cassandra given i chose this table schema or i need to store it in different form or i need to store this data in some other NoSql database where i can pull the reports easily?

Data Input:

   "Cookie":    "Ty44EnySoklz3456fdseses"
   "Session":   "vmt2Z2EpHQ"
   "ItemId":    "812781,681091,672396,632596,772796,704596"
   "Referer":   "RefererValue"
   "Filter":    "city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4"
   "Impression": 1
   "DetailsView":1
   "PhotoView":  0
   "Response":   1
   "ShortListItems": "812781,681091,672396"

Data Stored in Cassandra:

    cookie               session   ItemID   Referer                                                                Filter       Impression  DetailsView  PhotoView   Response   ShortListItems  
Ty44EnySoklz3456fdseses vmt2Z2EpHQ 812781 RefererValue city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4 1 1 0 1 812781,681091,672396
Ty44EnySoklz3456fdseses vmt2Z2EpHQ 681091 RefererValue city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4 1 1 0 1 812781,681091,672396
Ty44EnySoklz3456fdseses vmt2Z2EpHQ 672396 RefererValue city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4 1 1 0 1 812781,681091,672396
Ty44EnySoklz3456fdseses vmt2Z2EpHQ 632596 RefererValue city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4 1 1 0 1 812781,681091,672396
Ty44EnySoklz3456fdseses vmt2Z2EpHQ 772796 RefererValue city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4 1 1 0 1 812781,681091,672396
Ty44EnySoklz3456fdseses vmt2Z2EpHQ 704596 RefererValue city=3001&filterbyadditional=2+4+3&ItemType=2&Category=10+1&color=12+7&owners=2+1&year=0-6&budget=2-4 1 1 0 1 812781,681091,672396
1

There are 1 answers

0
Jim Meyer On

Basic Cassandra does not support a like clause and it isn't very good at doing ad hoc queries. So if you want to use CQL to access this data, you need to design your Cassandra schema to support the exact queries you plan to make. For example, if you want to do queries on the category value, you might have category as a clustering column, and then you could do range queries on the category value. For other queries you might have parallel tables using different fields for the keys.

But since you mention running a batch job to generate reports, you would probably want to run a map reduce type operation on your raw table data. For this one of the most popular methods is to use Apache Spark with Cassandra. With the Cassandra Spark connector, you can read Cassandra table data into a spark RDD, then run transformations on that data (for example to filter rows based on the category or some other values).

If you take that approach, you would want to partition your table data in some reasonable way so that spark would not have to do a full table scan to generate a report, but would instead read data partitioned by date for example.