Does count() produces the underlying table it needs to count?

74 views Asked by At

My boss wants me to do a join on three tables, let's call them tableA, tableB, tableC, which have respectively 74M, 3M and 75M rows.

In case it's useful, the query looks like this :

SELECT A.*, 
   C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
    ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
    ON B."field_join_BC" = C."field_join_BC"

When trying the query on Dataiku Data Science Studio + Vertica, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.

My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).

Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT(), such as this, for instance :

SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
    ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
    ON B."field_join_BC" = C."field_join_BC"

Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).

Or is it possible that a COUNT() gives me a result ? (because it's not building the dataset but working it out some other way)

(NB : I am currently testing it, but the count has been running for 35mn now)

1

There are 1 answers

0
woot On BEST ANSWER

Vertica is a columnar database. Any query you do only needs to look at the columns required to resolve output, joins, predicates, etc.

Vertica also is able to query against encoded data in many cases, avoiding full materialization until it is actually needed.

Counts like that can be very fast in Vertica. You don't really need to jump through hoops, Vertica will only include columns that are actually used. The optimizer won't try to reconstitute the entire row, only the columns it needs.

What's probably happening here is that you have hash joins with rebroadcasting. If your underlying projections do not line up and your sorts are different and you are joining multiple large tables together, just the join itself can be expensive because it has to load it all into hash and do a lot of network rebroadcasting of the data to get the joins to happen on the initiator node.

I would consider running DBD using these queries as input, especially if these are common query patterns. If you haven't run DBD at all yet and are not using custom projections, then your default projections will likely not perform well and cause the situation I mention above.

You can do an explain to see what's going on.