I'm working on figuring out a way to fetch distinct column values of the KQL output. For example, here is a sample table schema
Id| Code| Name | Area | Role| Type | DateField
1 a John Seattle MD | 12 | 12/1/2020
2 a Mark Chicago CC | 13 | 12/1/2020
2 b Mark Seattle SL | 14 | 12/1/2020
3 a Mike NY AB | 12 | 12/1/2020
3 b Mike CA CD | 15 | 12/1/2023
3 c Mike Seattle MD | 12 | 12/1/2023
Expected output is a Json property bag of distinct values of each column.
IdName
[{"Id":"1", "Name": "John"},{"Id":"2", "Name":"Mark"},{"Id":"3", "Name":"Mike"}]
Code
[{"Code":"A"},{"Code":"B"},{"Code":"C"}]
Area
[{"Area":"Seattle"},{"Area":"Chicago"},{"Area":"NY"},{"Area":"CA"}]
Type
[{"Type":"MD"},{"Type":"CC"},{"Type":"SL"},{"Type":"AB"}, {"Type":"CD"}]
The table is a very large entity and holds more than 7B+ records. My cluster has 60 nodes and all the data is in hot cache.
Below is my sample query being used today and looking around for another optimization suggestions to get the data in the desired output. It is taking about 8-10secs because of the datefiled lookup, the query returns about 300M+ records and then using the bag_pack function to collage everything into json by each column.
- The first permissions table is a dynamic table for every request that stores the permissions of the user based on Id and permissionCode. These permissions could vary for every customer and could return about 1000 rows.
Here is a sample query that returns about 300 million rows before using the pack_all() attributes.
- The output we're trying to generate here is, distinct values of every column (From 300 million records). Hence, there are multiple pack_all() for every column that is required.
// The customers table is massive and holds 7B+ records in realtime.
let Customers = datatable(Id:string, Code:string, Name:string, Area:string, Role:string, Type:string, DateField:datetime )
['1','a','John','Seattle','MD','12',datetime("2020-12-01"),
'2','a','Mark','Chicago','CC','13',datetime("2020-12-01"),
'2','b','Mark','Seattle','SL','14',datetime("2020-12-01"),
'3','a','Mike','NY','AB','12',datetime("2023-12-01"),
'3','b','Mike','CA','CD','15',datetime("2023-12-01"),
'3','c','Mike','Seattle','MD','12',datetime("2023-12-01")];
// This is a dynamic table that is generated at runtime for every query execution and the Ids and codes changes for every request.
let IdCodes = datatable(Id:string, Code:string)['1','A','1','B','2','A', '2','B','3','A','3','B','3','C'];
let myData = Customers
| where DateField >= datetime(2019-12-01 22:50:15.5050000)
| where DateField <= datetime(2023-12-01 22:50:15.5050000)
// As the Customers table holds the data in lowercase, this is normalized for lookup condition. Not sure, if there is any other optimized way as it seems this takes quite a memory at runtime for conversion.
| extend normalizedCode = toupper(Code)
| lookup kind=inner IdCodes on $left.Id == $right.Id and $left.normalizedCode == $right.Code;
let entity=myData;
let IdName=toscalar(entity|distinct Id,Name |extend dyn=pack_all()
| summarize hint.strategy = shuffle make_list(dyn));
let Code =toscalar(entity|distinct Code |extend dyn=pack_all()| summarize hint.strategy = shuffle make_list(dyn));
let Area=toscalar(entity|distinct Area |extend dyn=pack_all()| summarize hint.strategy = shuffle make_list(dyn));
let Type=toscalar(entity|distinct Type |extend dyn=pack_all()| summarize hint.strategy = shuffle make_list(dyn));
(print IdName, Code, Area, Type)
Is there any other optimized approach to achieve this output? These requests are called through our APIs and the queries are memory intensive and sometimes causing bottlenecks in our cluster.
Any help is appreciated !
You are going in right direction and Below is the KQL query which worked for me:
Output:
Fiddle.
Here, I have given code, you can give IdName, Area and Type too.