How to take distinct column values of Rows from big data KQL query output

138 views Asked by At

I'm working on figuring out a way to fetch distinct column values of the KQL output. For example, here is a sample table schema

Id| Code| Name | Area   | Role| Type | DateField
1     a  John   Seattle  MD  | 12   | 12/1/2020
2     a  Mark   Chicago  CC  | 13   | 12/1/2020
2     b  Mark   Seattle  SL  | 14   | 12/1/2020
3     a  Mike   NY       AB  | 12   | 12/1/2020
3     b  Mike   CA       CD  | 15   | 12/1/2023
3     c  Mike   Seattle  MD  | 12   | 12/1/2023


Expected output is a Json property bag of distinct values of each column.

IdName 
[{"Id":"1", "Name": "John"},{"Id":"2", "Name":"Mark"},{"Id":"3", "Name":"Mike"}]

Code 
[{"Code":"A"},{"Code":"B"},{"Code":"C"}]

Area
[{"Area":"Seattle"},{"Area":"Chicago"},{"Area":"NY"},{"Area":"CA"}]

Type
[{"Type":"MD"},{"Type":"CC"},{"Type":"SL"},{"Type":"AB"}, {"Type":"CD"}]

The table is a very large entity and holds more than 7B+ records. My cluster has 60 nodes and all the data is in hot cache.

Below is my sample query being used today and looking around for another optimization suggestions to get the data in the desired output. It is taking about 8-10secs because of the datefiled lookup, the query returns about 300M+ records and then using the bag_pack function to collage everything into json by each column.

  • The first permissions table is a dynamic table for every request that stores the permissions of the user based on Id and permissionCode. These permissions could vary for every customer and could return about 1000 rows.

Here is a sample query that returns about 300 million rows  before using the pack_all() attributes.

  • The output we're trying to generate here is, distinct values of every column (From 300 million records). Hence, there are multiple pack_all() for every column that is required.
// The customers table is massive and holds 7B+ records in realtime.
let Customers = datatable(Id:string, Code:string, Name:string, Area:string, Role:string, Type:string, DateField:datetime )
['1','a','John','Seattle','MD','12',datetime("2020-12-01"),
'2','a','Mark','Chicago','CC','13',datetime("2020-12-01"),
'2','b','Mark','Seattle','SL','14',datetime("2020-12-01"),
'3','a','Mike','NY','AB','12',datetime("2023-12-01"),
'3','b','Mike','CA','CD','15',datetime("2023-12-01"),
'3','c','Mike','Seattle','MD','12',datetime("2023-12-01")];

// This is a dynamic table that is generated at runtime for every query execution and the Ids and codes changes for every request.
let IdCodes  = datatable(Id:string, Code:string)['1','A','1','B','2','A', '2','B','3','A','3','B','3','C'];
let myData = Customers
| where DateField >= datetime(2019-12-01 22:50:15.5050000)
| where DateField <= datetime(2023-12-01 22:50:15.5050000)
// As the Customers table holds the data in lowercase, this is normalized for lookup condition. Not sure, if there is any other optimized way as it seems this takes quite a memory at runtime for conversion.
| extend normalizedCode = toupper(Code)
| lookup kind=inner IdCodes on $left.Id == $right.Id and $left.normalizedCode == $right.Code;
let entity=myData;
let IdName=toscalar(entity|distinct Id,Name |extend dyn=pack_all()
| summarize hint.strategy = shuffle  make_list(dyn));
let Code =toscalar(entity|distinct Code |extend dyn=pack_all()| summarize hint.strategy = shuffle  make_list(dyn));
let Area=toscalar(entity|distinct Area |extend dyn=pack_all()| summarize hint.strategy = shuffle  make_list(dyn));
let Type=toscalar(entity|distinct Type |extend dyn=pack_all()| summarize hint.strategy = shuffle  make_list(dyn));
(print IdName, Code, Area, Type)

Is there any other optimized approach to achieve this output? These requests are called through our APIs and the queries are memory intensive and sometimes causing bottlenecks in our cluster.

Any help is appreciated !

1

There are 1 answers

2
RithwikBojja On

How to take distinct column values of Rows from big data KQL query output,Is there any other optimized approach to achieve this output?

You are going in right direction and Below is the KQL query which worked for me:

let RithTable = 
    datatable(Id:int, Code:string, Name:string, Area:string, Role:string, Type:string, DateField:datetime)
    [
        1, "a", "Rithwik", "Seattle", "MD", "12", datetime(2020-12-01),
        2, "a", "Bojja", "Chicago", "CC", "13", datetime(2020-12-01),
        2, "b", "Bojja", "Seattle", "SL", "14", datetime(2020-12-01),
        3, "a", "Chotu", "NY", "AB", "12", datetime(2020-12-01),
        3, "b", "Chotu", "CA", "CD", "15", datetime(2023-12-01),
        3, "c", "Chotu", "Seattle", "MD", "12", datetime(2023-12-01)
    ];
let IdName = RithTable
    | distinct Id, Name
    | summarize IdName = make_list(pack("Id", Id, "Name", Name));
let Code = RithTable
    | distinct Code
    | summarize Code = make_list(pack("Code", Code));
let Area = RithTable
    | distinct Area
    | summarize Area = make_list(pack("Area", Area));
let Type = RithTable
    | distinct Type
    | summarize Type = make_list(pack("Type", Type));
Code

Output:

enter image description here

Fiddle.

Here, I have given code, you can give IdName, Area and Type too.