count(Distinct) windowing function alternative

91 views Asked by At

I have written a query to take compare of columns/data of 2 database table in oracle which will print records where there is any difference in rows 2 tables based on key columns (fields in partition by clause) and if there is no difference then for any particular field then it will print null and for fields where there is difference then it will print there values (where row_cnt = 2) and if there is some extra record in any of table then it also print it (where row_cnt =1).

Sample script --

create table src_tab (id number,name varchar2(10), job varchar2(10), sal number);

create table tgt_tab (id number,name varchar2(10), job varchar2(10), sal number);

insert into src_tab values (1, 'john', 'not sure', 100);
insert into src_tab values (2, 'kevin', 'unemp', 200);
insert into src_tab values (3, 'chad', 'emplyd', 400);
insert into src_tab values (4, 'page', 'NA', 500);
insert into src_tab values (5, 'emmy', 'desk', 600);

insert into tgt_tab values (1, 'john', 'not sure', 150);
insert into tgt_tab values (2, 'kevin', 'unemp', 200);
insert into tgt_tab values (3, 'chad', 'empl', 400);
insert into tgt_tab values (4, 'page', 'NA', 500); 

Comparison query --
Select * from (select id,name,
  case NEW_CNT when 1 then 'TGT_TAB' else 'SRC_TAB' end tbl,
  ROW_CNT,
  case when count(job) over(partition by id,name)
    between 1 and count(distinct job) over(partition by id,name)
    then job end job,
  case when count(sal) over(partition by id,name)
    between 1 and count(distinct sal) over(partition by id,name)
    then sal end sal
FROM (
  select
    id,name, job, sal,
    sum(NEW_CNT) NEW_CNT, count(*) over(partition by id,name) ROW_CNT
  FROM (
    select 
    id,name, job, sal,
    -1 NEW_CNT
    from src_tab O
    union all
    select
    id,name, job, sal,
    1 NEW_CNT
    from tgt_tab N 
  )
  group by
    id,name, job, sal
  having sum(NEW_CNT) != 0
)
order by 1, 2, new_cnt)
;

Since this query is written for oracle 19c database now I require to run this query on Sybase database but when I tried it on Sybase then I found out that COUNT(DISTINCT) windowing function is not allowed there so now looking for alternative ways to print the same output as it is on Sybase I tried using lead/lag function since it is supported on Sybase but the output is not good.

Alternative Tried --
case when sal <>
    lead(sal,1,sal) over (partition by id,name order by id) then sal  end sal_lead

Expected Result --

ID  NAME    TBL     ROW_CNT JOB      SAL
1   john    SRC_TAB   2     null     100
1   john    TGT_TAB   2     null     150
3   chad    SRC_TAB   2     emplyd   null
3   chad    TGT_TAB   2     empl     null
5   emmy    SRC_TAB   1     desk     600

We are using both Sybase DB ASE and IQ if the query can be compatible with Sybase-IQ then also it will good. Please let me know if more information is required from my side to understand the query I have written in orcale.

Edit 1: One Alternative I found which is working on Sybase-IQ is as below if someone has more good approach please do comment !!

with t1 as (
             select  'SRC_TAB' tbl,
                     s.*
               from  src_tab s
            union all
             select  'TGT_TAB' tbl,
                     t.*
               from  tgt_tab t
           ),
     t2 as (
            select  t1.*,
                    case max(job) over(partition by id,name)
                      when min(job) over(partition by id,name) then 1
                      else 0
                    end same_job_flag,
                    case max(sal) over(partition by id,name)
                      when min(sal) over(partition by id,name) then 1
                      else 0
                    end same_sal_flag,
                    count(*) over(partition by id,name) row_cnt
              from  t1
            )            
select  id,
        name,
        tbl,
        row_cnt,
        case
          when row_cnt = 1 then job 
          when same_job_flag = 0 then job
        end job,
        case
          when row_cnt = 1 then sal 
          when same_sal_flag = 0 then sal
        end sal
  from  t2
  where same_job_flag = 0
     or same_sal_flag = 0
     or row_cnt = 1
  order by id,
           name,
           tbl;

Edit 2 : Oracle Procedure to Generate Comparison query (query -2, Alternative which I have written). I have written same procedure for my 1st Oracle query as well with COUNT(DISTINCT) one :)

CREATE OR REPLACE PROCEDURE pr_compare_table2(
                                              p_key_coln  VARCHAR2,
                                              p_comp_coln VARCHAR2,
                                              p_src_tab   VARCHAR2,
                                              p_tgt_tab   VARCHAR2,
                                              p_src_whr   VARCHAR2,
                                              p_tgt_whr   VARCHAR2
                                              )
AS

  l_sql    CLOB;
  l_sql1   CLOB;
  l_sql2   CLOB;
  l_sql3   CLOB;
  l_length NUMBER;

BEGIN 
  
  FOR i IN (SELECT REGEXP_SUBSTR(p_comp_coln,'[^,]+',1,LEVEL) value
              FROM dual
           CONNECT BY LEVEL <= LENGTH(p_comp_coln) - LENGTH(REPLACE(p_comp_coln, ',' )) +1)
  LOOP
  
    l_sql := TO_CLOB(l_sql)||'CASE MAX('||i.value||') OVER(PARTITION BY '||p_key_coln|| ')
               WHEN MIN('||i.value||') OVER(PARTITION BY '||p_key_coln||' ) THEN 1 ELSE 0 END
               AS same_'||i.value||'_flag,';
  
  END LOOP;
  
  l_sql1 := 'WITH t1 AS (
             SELECT  '''||p_src_tab||''' tbl,
                     s.*
               FROM '|| p_src_tab ||' s '||p_src_whr||'
            UNION ALL
             SELECT  '''||p_tgt_tab||''' tbl,
                     t.*
               FROM  '|| p_tgt_tab ||' t '||p_tgt_whr||'
           ),
     t2 AS (
            SELECT  t1.*,'||TO_CLOB(l_sql)||
            'COUNT(*) OVER(PARTITION BY '||p_key_coln ||') row_cnt
            FROM  t1
            )    ';

  FOR i IN (SELECT REGEXP_SUBSTR(p_comp_coln,'[^,]+',1,LEVEL) value
              FROM dual
           CONNECT BY LEVEL <= LENGTH(p_comp_coln) - LENGTH(REPLACE(p_comp_coln, ',' )) +1)
  LOOP
  
    l_sql2 := TO_CLOB(l_sql2)||'
    CASE
          WHEN row_cnt = 1 THEN '||i.value||
          ' WHEN same_'||i.value||'_flag = 0 THEN '||i.value||' END '|| i.value||',';
          
  
  END LOOP;
  
  l_sql1 := TO_CLOB(l_sql1)||' 
            SELECT '||p_key_coln||', tbl,row_cnt, 
            '||TO_CLOB(SUBSTR(l_sql2,1, LENGTH(l_sql2)-1 ))|| '
            FROM t2 WHERE 1=1 AND';
  
  FOR i IN (SELECT REGEXP_SUBSTR(p_comp_coln,'[^,]+',1,LEVEL) value
              FROM dual
           CONNECT BY LEVEL <= LENGTH(p_comp_coln) - LENGTH(REPLACE(p_comp_coln, ',' )) +1)
  LOOP
  
    l_sql3 := TO_CLOB(l_sql3)||'  same_'||i.value||'_flag = 0 or ';
            
  END LOOP;  
  
  l_sql1 := TO_CLOB(l_sql1)||' '||TO_CLOB(l_sql3)||'
             row_cnt = 1
            ORDER BY '||p_key_coln||';';
  
  l_length := LENGTH(l_sql1);
  IF l_length < 32767 THEN
    dbms_output.put_line(l_sql1);     
  ELSE
    dbms_output.put_line(LENGTH(l_sql1));    
  END IF;

END pr_compare_table2;
2

There are 2 answers

2
markp-fuso On

Background:

  • ASE's SQL dialect is rather old (due to SAP letting it die on the vine) with limited capabilities (eg, does not support windowing functions, does not support common table expressions - CTEs)
  • IQ's SQL dialect is more 'modern' (though I can't speak to its completeness of support for windowing functions and CTEs)

If OP's IQ query (with CTEs and windowing functions) generates the expected result then one option OP may want to consider:

  • setup proxy/remote tables (from IQ to ASE)
  • run the IQ query against the proxy/remote tables
  • one of IQ's selling points was the ability to run IQ-centric queries against ASE data using this approach
  • one downside (of course) is that IQ will need to pull all of the ASE data across the network so that it can be processed by the IQ query engine

I was able to hobble together a couple queries that run in ASE that generate the expected result ...

Adding a few more rows to the data set:

insert into tgt_tab values ( 6, 'bob',   'unemp',    350)      -- TGT_TAB only row

insert into src_tab values (10, 'sara',  'desk',     450)      -- common key but with differences
insert into tgt_tab values (10, 'sara',  'umenp',    300)      -- in both non-key columns

One 4-way union query:

-- only in src_tab

select  s.id,s.name,tbl='SRC_TAB',ROW_CNT=1,s.job,s.sal
from    src_tab s
where   not exists(select  1
                   from    tgt_tab t
                   where   t.id   = s.id
                   and     t.name = s.name)

union all

-- only in tgt_tab

select  t.id,t.name,tbl='TGT_TAB',ROW_CNT=1,t.job,t.sal
from    tgt_tab t
where   not exists(select  1
                   from    src_tab s
                   where   t.id   = s.id
                   and     t.name = s.name)

union all

-- common key but job and/or sal are different (src_tab)

select  s.id,s.name,tbl='SRC_TAB',ROW_CNT=2,
        case when s.job != t.job then s.job else NULL end as job,
        case when s.sal != t.sal then s.sal else NULL end as sal

from    src_tab s

join    tgt_tab t
on      s.id   = t.id
and     s.name = t.name

where   s.job != t.job
or      s.sal != t.sal

union all

-- common key but job and/or sal are different (tgt_tab)

select  s.id,s.name,tbl='TGT_TAB',ROW_CNT=2,
        case when s.job != t.job then t.job else NULL end as job,
        case when s.sal != t.sal then t.sal else NULL end as sal

from    src_tab s

join    tgt_tab t
on      s.id   = t.id
and     s.name = t.name

where   s.job != t.job
or      s.sal != t.sal

order by id,name,tbl

We can cut this down to a 3-way union by using a 2-row dummy/derived table to split a common key match into 2 separate rows:

-- only in src_tab

select  s.id,s.name,tbl='SRC_TAB',ROW_CNT=1,s.job,s.sal
from    src_tab s
where   not exists(select  1
                   from    tgt_tab t
                   where   t.id   = s.id
                   and     t.name = s.name)

union all

-- only in tgt_tab

select  t.id,t.name,tbl='TGT_TAB',ROW_CNT=1,t.job,t.sal
from    tgt_tab t
where   not exists(select  1
                   from    src_tab s
                   where   t.id   = s.id
                   and     t.name = s.name)

union all

-- common key but job and/or sal are different

select  s.id,s.name,dt.tbl,ROW_CNT=2,
        case when dt.tbl = 'SRC_TAB' and s.job != t.job then s.job
             when dt.tbl = 'TGT_TAB' and s.job != t.job then t.job
             else NULL
        end as job,
        case when dt.tbl = 'SRC_TAB' and s.sal != t.sal then s.sal
             when dt.tbl = 'TGT_TAB' and s.sal != t.sal then t.sal
             else NULL
        end as sal

from    src_tab s

join    tgt_tab t
on      s.id   = t.id
and     s.name = t.name

join    (select 'SRC_TAB'
         union all
         select 'TGT_TAB'
        ) dt (tbl)
on      1=1

where   s.job != t.job
or      s.sal != t.sal

order by id,name,tbl

NOTES:

  • because each standalone query is geared towards a specific scenario our ROW_CNT values are just hardcoded values (the only reason ROW_CNT is generated by these queries is to match OP's expected output)
  • since these queries do not rely on CTEs or windowing functions it should be (relatively) easy to tweak the syntax to insure they run in most (all?) other RDMBS environments, though the performance of said queries may not be comparable to CTE/windowing-function queries

Both of these queries generate the following:

 id          name       tbl     ROW_CNT     job        sal
 ----------- ---------- ------- ----------- ---------- -----------
           1 john       SRC_TAB           2 NULL               100
           1 john       TGT_TAB           2 NULL               150

           3 chad       SRC_TAB           2 emplyd            NULL
           3 chad       TGT_TAB           2 empl              NULL

           5 emmy       SRC_TAB           1 desk               600

           6 bob        TGT_TAB           1 unemp              350

          10 sara       SRC_TAB           2 desk               450
          10 sara       TGT_TAB           2 umenp              300

NOTES:

  • blank lines inserted for readability
  • results generated from an ASE 16.0 SP04 PL04 instance

I'm guessing OP's question is a minimal, reproducible example of a bigger project that needs to run a diff against a set tables.

If this is the case then it won't take long to figure out that none of the queries presented so far (OP's queries, my queries) 'scale up' very well. All of these queries are hardcoded towards a specific pair of tables, a specific set of columns and a specific set of datatypes.

For a full blown diff process OP will need to take into consideration:

  • a variable number of key columns
  • a variable number of non-key columns (ie, expanded number of case statements)
  • a wider assortment of datatypes (eg, comparing (b)lob columns? comparing date/time datatypes with rounding issues at the ms level?, comparing textual data stored under different character sets? comparing float/reals where accuracy is hardware dependent?)
  • the need to treat NULLs specially (eg, in some SQL dialects = NULL and is NULL behave differently)
  • the performance overhead for large(ish) tables may be prohibitive

To build a robust tool will take a good bit of time and effort, especially if trying to build the entire tool in SQL. A few ideas to consider:

  • use something other than SQL to perform the comparisons (eg, awk, perl, ruby, c(++), java)
  • see if any of your RDBMS vendors provide a data comparison tool (eg, Sybase/SAP's Repserver has rs_subcmp (OS-level, java-based utility) and Data Assurance (OS-level, java-based, licen$$$ed utility))
  • see if any (in-house) ETL tools have a 'diff' capability
  • find a 3rd party product that has a 'diff' capability (eg, 3rd party tools that migrate data between different RDBMS products and/or instances)
0
mradul On
with t1 as (
             select  'SRC_TAB' tbl,
                     s.*
               from  src_tab s
            union all
             select  'TGT_TAB' tbl,
                     t.*
               from  tgt_tab t
           ),
     t2 as (
            select  t1.*,
                    case max(job) over(partition by id,name)
                      when min(job) over(partition by id,name) then 1
                      else 0
                    end same_job_flag,
                    case max(sal) over(partition by id,name)
                      when min(sal) over(partition by id,name) then 1
                      else 0
                    end same_sal_flag,
                    count(*) over(partition by id,name) row_cnt
              from  t1
            )            
select  id,
        name,
        tbl,
        row_cnt,
        case
          when row_cnt = 1 then job 
          when same_job_flag = 0 then job
        end job,
        case
          when row_cnt = 1 then sal 
          when same_sal_flag = 0 then sal
        end sal
  from  t2
  where same_job_flag = 0
     or same_sal_flag = 0
     or row_cnt = 1
  order by id,
           name,
           tbl;