store unique visitors in distributed database

273 views Asked by At

I have structure data like that ( web visitors )

List(p1,p1,p1,p2,p3,p3,p4,p4,p5...)

one visitor can visit 1 --> many times

data volumes: about 100 milions / day

How about or which db i can store unique visitors to fast access ( near real time ) like that

2014-11-15 | p1 | p2 | p3 | ...| pn

I try to work around by using Cassandra by using table like that :

CREATE TABLE uniqueVisitor (
  key text,
  p text,
  PRIMARY KEY (key, data)
) 

I think this store pattern is not work very well because :

Because of partitioning data of this table , All data of a key will store in only one server ( with replicate factor =1 ) == > too many write requests can blow out the server stored this key.

Please suggest me a solution (storage pattern )

2

There are 2 answers

2
Pradyumn On BEST ANSWER

You could use a set, as it eliminates duplicates (and has no specific ordering in it). For example,

CREATE TABLE uniqueVisitor (
  dt text,
  users set<text>,
  PRIMARY KEY (dt)
);

You are right, data for a single day would not get distributed; it will be on a single node (and the replicas). Different dates' records, of course, will get distributed. So that's a potential write hotspot. Having said that, I think the write hotspot may not really matter much in this case, as it is a single (though gigantic) record that is getting modified. Every user visit will not result in disk I/O though, as changes will first be made in memory, in memtables, and only when memtables is flushed to disk, it will be written to an SSTable. Data from multiple SSTables will periodically get compacted, which may have some performance cost, though I imagine it would not kill your application.

In Cassandra 2.1, it is also possible to create indexes on the collection types like SETs.

Hope this helps.

0
ashic On

It is quite common when dealing with large volume data streams to sacrifice some accuracy for efficiency. There are some algorithms to estimate the number of uniques given a large volume data stream. They require much much less space than simply storing each unique, require far less processing (can be done in memory on a single node even - or a few nodes), and provide results with at least 50% accuracy (and a lot more if you do more work). Have a look at the Flajolet-Martin algorithm, and (better) the Alon-Matias-Szegedy (AMS) algorithm. You can find brief descriptions here: http://www.st.ewi.tudelft.nl/~hauff/BDP-Lectures/3_streams.pdf and detailed analysis in Prof. Ullman et. al.'s book that's freely available here: http://mmds.org/ . I believe it's chapter 4 that covers stream processing quite nicely.