SELECT count of subquery before applying LIMIT (clickhouse)

5.6k views Asked by At

I have a subquery that aggregates some UNION ALL selects. Over that I prepare the SELECT to create cross-tab and limit it to let's say 20. I would like to be able to retrieve the total COUNT of sub query results before I am limiting them in main query. This is for the purpose of trying to build a pagination that receives the total number of records and then the specific page record grid.

Sample query:

SELECT 
    name, 
    sumIf(metric_value, metric_name = 'data') AS data,
    sumif(....
FROM
    (SELECT 
         name, metric_name, SUM(metric_value) as metric_value 
     FROM
         (SELECT 
              name, 'data' AS metric_name, SUM(data) AS metric_value 
          FROM 
              table 
          WHERE 
              date > '2017-01-01 00:00:00' 
          GROUP BY 
              name

          UNION ALL

          SELECT 
              name, 'data' AS metric_name, SUM(data) AS metric_value 
          FROM 
              table2 
          WHERE 
              date > '2017-01-01 00:00:00' 
          GROUP BY 
              name

          UNION ALL

          SELECT 
              name, 'data' AS metric_name, SUM(data) AS metric_value 
          FROM 
              table3 
          WHERE 
              date > '2017-01-01 00:00:00' 
          GROUP BY 
              name

          UNION ALL

          .
          .
          .)
    GROUP BY 
        name, metric_name)
GROUP BY 
    name 
ORDER BY 
    name ASC
LIMIT 0,20;

The first subselect returns tons of data, so I thought I can count it and return as one column value, or row and it would propagate to main select that limits 20 results. Because I need to know the entire set of results but don;t want to call the same query twice without limit and with limit just to get COUNT. There are at least 12 UNION ALL third level sub selects, so why waste resources. I am looking to try generic SQL solutions not necessarily related to ClickHouse

I was thinking of using count(*) OVER (), however that is not supported, so if thats only option I know I need to run query twice.

1

There are 1 answers

2
newtover On

The first thing that one should mention is that nobody is usually interested in the exact number of pages on a query. It can be easily estimated and almost no one will care how exact is the estimation. However, if you have a link to the last page in your GUI, people will often click to link just to see whether it works.

Nevertheless, there are cases when an analyst should visit all the pages, and then the GUI should display the exact amount of work. A good news is that in that latter case, a better strategy is to cache a snapshot of the whole results table and counting the rows in the table becomes not a problem anymore.

I mean, it makes sense to discuss with the customers whether they really need it, because unneeded full scans many times per day may have effect on the database load and billing sums.

Anyway, if you still need to estimate the number of rows, you can simplify the query just to count the number of rows. As I understand this is something like:

SELECT SUM(cnt) as row_count
FROM (
    SELECT COUNT(DISTINCT name) as cnt FROM table1 WHERE date > ...
    UNION ALL
    SELECT COUNT(DISTINCT name) as cnt FROM table2 WHERE date > ...
    ...
) as counts;

or if data is a constant metric name

SELECT COUNT(DISTINCT name) as row_count
FROM (
    SELECT DISTINCT name FROM table1 WHERE date > ...
    UNION ALL
    SELECT DISTINCT name FROM table2 WHERE date > ...
    ...
) as names;