I'm using a sharded set where my table is described like this This is my _all_all table:
ENGINE = Distributed('level2_cluster', 'datas', 'data_animals_0_60_all', rand())
This is my _all table:
ENGINE = Distributed('level3_cluster', 'datas', 'data_animals_0_60', rand())
This is my normal table_60:
ENGINE = MergeTree
PARTITION BY datadate
PRIMARY KEY datatime
ORDER BY datatime
SETTINGS index_granularity = 8192
My cluster relationship is described like this:
┌─cluster──────────┬─shard_num─┐
│ level3_cluster │ 1 │
│ level2_cluster │ 1 │
│ level2_cluster │ 2 │
└──────────────────┴───────────┘
They are used to describe data at different levels. Because each node may have multiple nodes, we can use the _all table of level3 to obtain the overall data of level3.
wher i use:
select _shard_num,hostname,count() from data_animals_0_60_all_all group by _shard_num,hostname ;
i get this:
┌─_shard_num─┬─hostname──────────┬─count()─┐
│ 1 │ HOST_NAME_150_165 │ 2400 │
│ 1 │ HOST_NAME_153_165 │ 3900 │
└────────────┴───────────────────┴─────────┘
(I am very clear about the test data I wrote. 150 refers to the data written at the 0.150 node, and 153 refers to the data written at the 0.153 node. As for the suffix 165, it is fixed and we can ignore it.)
The results show that I cannot use _shard_num to get the node from which the data comes, because the node it returns is the cluster configuration of the next level _all table (that is, level3) Of course, I also know that the business layer can manage a unique node ID for use as a dimension, but that is a very huge modification for us.
I hope that I can query level2 to describe the return of relevant data on all nodes. This is a minimal change for me. Is there any other description that can let me know that the data comes from a certain node?
For me, the worst case scenario is to increase the workload to reach the so-called node ID of the node. I am very much looking forward to a better solution, please help me.