Best way for storing millions of records a day of data that can be grouped for statistic purposes?

2.2k views Asked by At

I'm developing a custom tracking tool for marketing campaigns. This tool is in the middle between the ads and the landing pages. It takes care of saving all data from the user, such as the info in the user-agent, the IP, the clicks on the landing page and the geocoding data of the IPs of the users (country, ISP, etc).

At the moment I have some design issues:

  • The traffic on these campaigns is very very high, so potentially I have millions of rows insert a day. This system can have more than one user, so I can't store all this data on a single table because would become a mess. Maybe I can split the data in more tables, one table per user, but I'm not sure about this solution.
  • The data saving process must be done as quickly as possible (some milliseconds), so I think that NodeJS is much better than PHP for doing this. Especially with regard to speed and server resources. I do not want the server to crash from lack of RAM.
  • I need to group these data for statistic purposes. For example, I have one row for every user that visit my landing page, but I need to group these data for showing the number of impressions on this specific landing page. So all these queries need to be executed as faster as possible with this large amount of rows.
  • I need to geocode the IP addresses, so i need accurate information like the Country, the ISP, the type of connection etc, but this can slow down the data saving process if I call an API service. And this must be done in real-time and can't be done later.

After the saving process, the system should do a redirect to the landing page. Time is important for not losing any possible lead.

Basically, I'm finding the best solutions for:

  • Efficiently manage a very large database
  • Saving data from the users in the shortest time possible (ms)
  • If possible, make geocode an ip in the shortest time possible, without blocking execution
  • Optimize the schema and the queries for generating statistics

Do you have any suggestion? Thanks in advance.

1

There are 1 answers

5
Rick James On

One table per user is a worse mess; don't do that.

Millions of rows a day -- dozens, maybe hundreds, per second? That probably requires some form of 'staging' -- collecting multiple rows, then batch-inserting them. Before discussing further, please elaborate on the data flow: Single vs. multiple clients. UI vs. batch processes. Tentative CREATE TABLE. Etc.

Statistical -- Plan on creating and incrementally maintaining "Summary tables".

Are you trying to map user IP addresses to Country? That is a separate question, and it has been answered.

"Must" "real-time" "milliseconds". Face it, you will have to make some trade-offs.

More info: Go to http://mysql.rjweb.org/ ; from there, see the three blogs on Data Warehouse Techniques.

How to store by day

InnoDB stores data in PRIMARY KEY order. So, to get all the rows for one day adjacent to each other, one must start the PK with the datetime. For huge databases, may improve certain queries significantly by allowing the query to scan the data sequentially, thereby minimizing disk I/O.

If you already have id AUTO_INCREMENT (and if you continue to need it), then do this:

PRIMARY KEY(datetime, id),  -- to get clustering, and be UNIQUE
INDEX(id)  -- to keep AUTO_INCREMENT happy

If you have a year's worth of data, and the data won't fit in RAM, then this technique is very effective for small time ranges. But if your time range is bigger than the cache, you will be at the mercy of I/O speed.

Maintaining summary tables with changing data

This may be possible; I need to better understand the data and the changes.

You cannot scan a million rows in sub-second time, regardless of caching, tuning, and other optimizations. You can do the desired data with a Summary table much faster.

Shrink the data

  • Don't use BIGINT (8 bytes) if INT (4 bytes) will suffice; don't use INT if MEDIUMINT (3 bytes) will do. Etc.
  • Use UNSIGNED where appropriate.
  • Normalize repeated strings.

Smaller data will make it more cacheable, hence run faster when you do have to hit the disk.