I have structure data like that ( web visitors )
List(p1,p1,p1,p2,p3,p3,p4,p4,p5...)
one visitor can visit 1 --> many times
data volumes: about 100 milions / day
How about or which db i can store unique visitors to fast access ( near real time ) like that
2014-11-15 | p1 | p2 | p3 | ...| pn
I try to work around by using Cassandra by using table like that :
CREATE TABLE uniqueVisitor (
key text,
p text,
PRIMARY KEY (key, data)
)
I think this store pattern is not work very well because :
Because of partitioning data of this table , All data of a key will store in only one server ( with replicate factor =1 ) == > too many write requests can blow out the server stored this key.
Please suggest me a solution (storage pattern )
You could use a set, as it eliminates duplicates (and has no specific ordering in it). For example,
You are right, data for a single day would not get distributed; it will be on a single node (and the replicas). Different dates' records, of course, will get distributed. So that's a potential write hotspot. Having said that, I think the write hotspot may not really matter much in this case, as it is a single (though gigantic) record that is getting modified. Every user visit will not result in disk I/O though, as changes will first be made in memory, in memtables, and only when memtables is flushed to disk, it will be written to an SSTable. Data from multiple SSTables will periodically get compacted, which may have some performance cost, though I imagine it would not kill your application.
In Cassandra 2.1, it is also possible to create indexes on the collection types like SETs.
Hope this helps.