convert SQL output in databricks from long to wide

114 views Asked by At

I have a very simple outcome in databricks that has 80M lines of data because there is a separate line for each PIN. The data is a listing by PIN with columns "pin", "device category", "device name". a PIN may have 1 or 3 or n devices and there are two possible device categories. I need to convert this from long to wide so I can have a distinct PIN with columns with the devices so I can use this to cluster. All the examples have distinct column names to plug in to the pivot IN statement. This doesn't need to be fancy where the column names change dynamically - they can just be "device1", "device2", "device3", "device4", etc. Any help you can provide would be greatly appreciated!

note: the column names "device1" etc do not need to be "devicen..". For example R just assigns column names. This what I am looking for here. However many columns are needed it just assigns them

Start at this - long

machid device
123 A
456 B
123 C

Needs to be this - wide

machid d1 d2 d3 d4
123 A C
456 B
1

There are 1 answers

0
June7 On

This is syntax that worked for me. I tested in Access pass-through query and in SSMS. I tried testing at SQLFiddle.com so I could provide a demo but get error "Must declare the scalar variable "@cols"." However, not seeing STUFF() nor STRING_AGG() nor QUOTENAME() functions in Databricks documentation and not finding equivalents.

SET NoCount ON;

DECLARE @cols AS NVARCHAR(MAX), @sql AS NVARCHAR(MAX);

WITH GS AS(
SELECT machid, device, 
Row_Number() OVER(PARTITION BY machid ORDER BY machid, device) As GrpSeq 
FROM table1)

SELECT @cols = STUFF(
(SELECT ',' + QUOTENAME(G) AS [text()]
FROM (SELECT DISTINCT GrpSeq AS G FROM GS) AS G
FOR XML PATH('')), 1, 1, '');

SELECT @sql = 'SELECT *
       FROM (SELECT machid, device, 
             Row_Number() OVER(PARTITION BY machid ORDER BY machid, device) As GrpSeq FROM table1) AS Q
       PIVOT(Min(device) FOR GrpSeq IN(' + @cols + ')) AS P;';

EXEC sp_executesql @sql;

And this version also worked:

SET NoCount ON;

DECLARE @cols AS NVARCHAR(MAX), @sql AS NVARCHAR(MAX);

WITH GS AS(
SELECT machid, device, QUOTENAME(Row_Number() OVER(PARTITION BY machid ORDER BY machid, device)) As GrpSeq FROM Table1),
DS AS
(SELECT DISTINCT GrpSeq FROM GS),
CS AS
(SELECT STRING_AGG(GrpSeq,',') AS G FROM DS)

SELECT @cols = G FROM CS

SELECT @sql = 'SELECT *
            FROM (SELECT machid, device, Row_Number() OVER(PARTITION BY machid ORDER BY machid, device) As GrpSeq FROM Table1) AS Q
            PIVOT(Min(machid) FOR GrpSeq IN(' + @cols + ')) AS P;';

EXEC sp_executesql @sql;

In either, if I try concatenating any text to Row_Number expression, it fails.