store unique visitors in distributed database

Question

store unique visitors in distributed database

282 views Asked by tnk_peka At 01 December 2014 at 07:00

I have structure data like that ( web visitors )

List(p1,p1,p1,p2,p3,p3,p4,p4,p5...)

one visitor can visit 1 --> many times

data volumes: about 100 milions / day

How about or which db i can store unique visitors to fast access ( near real time ) like that

2014-11-15 | p1 | p2 | p3 | ...| pn

I try to work around by using Cassandra by using table like that :

CREATE TABLE uniqueVisitor (
  key text,
  p text,
  PRIMARY KEY (key, data)
)

I think this store pattern is not work very well because :

Because of partitioning data of this table , All data of a key will store in only one server ( with replicate factor =1 ) == > too many write requests can blow out the server stored this key.

Please suggest me a solution (storage pattern )

Original Q&A

There are 2 answers

ashic On 02 December 2014 at 10:24

It is quite common when dealing with large volume data streams to sacrifice some accuracy for efficiency. There are some algorithms to estimate the number of uniques given a large volume data stream. They require much much less space than simply storing each unique, require far less processing (can be done in memory on a single node even - or a few nodes), and provide results with at least 50% accuracy (and a lot more if you do more work). Have a look at the Flajolet-Martin algorithm, and (better) the Alon-Matias-Szegedy (AMS) algorithm. You can find brief descriptions here: http://www.st.ewi.tudelft.nl/~hauff/BDP-Lectures/3_streams.pdf and detailed analysis in Prof. Ullman et. al.'s book that's freely available here: http://mmds.org/ . I believe it's chapter 4 that covers stream processing quite nicely.

**Pradyumn** · Accepted Answer · 2014-12-02T08:31:07+00:00

You could use a set, as it eliminates duplicates (and has no specific ordering in it). For example,

CREATE TABLE uniqueVisitor (
  dt text,
  users set<text>,
  PRIMARY KEY (dt)
);

You are right, data for a single day would not get distributed; it will be on a single node (and the replicas). Different dates' records, of course, will get distributed. So that's a potential write hotspot. Having said that, I think the write hotspot may not really matter much in this case, as it is a single (though gigantic) record that is getting modified. Every user visit will not result in disk I/O though, as changes will first be made in memory, in memtables, and only when memtables is flushed to disk, it will be written to an SSTable. Data from multiple SSTables will periodically get compacted, which may have some performance cost, though I imagine it would not kill your application.

In Cassandra 2.1, it is also possible to create indexes on the collection types like SETs.

Hope this helps.

TechQA.

store unique visitors in distributed database

There are 2 answers

Related Questions in DATABASE

Related Questions in CASSANDRA

Related Questions in BIGDATA

Related Questions in REAL-TIME-DATA

Popular Questions

Popular Tags

Trending Questions