Query with multiple permutations that depend on multiple rows

353 views Asked by At

(SQL2014 if that makes a difference)

Say I have [tableA]

id  ResultID  SampleID  ERRORCODE col4 colN

1   9001      1100      0         ...  ...
2   9002      1100      100       ...  ...
3   9003      1100      200       ...  ...
4   9004      1100      300       ...  ...
5   9005      1101      0         ...  ... 
6   9006      1101      0         ...  ...
7   9007      1101      0         ...  ...
8   9008      1101      0         ...  ...
9   9009      1102      0         ...  ...
10  9010      1102      100       ...  ...
11  9011      1102      200       ...  ...
12  9012      1102      0         ...  ...

and I want to produce a result that only shows the sample sets (identified by common SampleID vals) that contain all of errorcodes 0, 100, 200 & 300. i.e. the above would reduce to:

id  ResultID  SampleID  ERRORCODE col4 colN

1   9001      1100      0         ...  ...
2   9002      1100      100       ...  ...
3   9003      1100      200       ...  ...
4   9004      1100      300       ...  ...

So I'm needing a query that looks at multiple rows at one time and downselects groups that have (i) certain key values in a column, here thats [ERRORCODE] and (ii) consistent values in another column, here thats [SampleID]. I've looked at:

Query with multiple IN clause on multiple rows

But didn't have any joy. The subquery that worked for oliboon is only working on one row for me. Olga's code didn't work at all and Aushin's produced unexpected results (and removing half the syntax didn't change them)!

N00b to SQL, so I'm a bit lost!

The "table" keyword used in a few of those solutions listed doesn't seem to work for me - and it may be that those answers were intent on creating a subtable which was then further parsed in subqueries for their solution. All I get is an "incorrect syntax near the keyword table" error if I try anything like:

select distinct SampleID from table [my].[db].[path].[tableA]

I thought a query of the format

SELECT *
From [tableA]
where
    [SampleID] in (Select [SampleID] from [tableA] where [ERRORCODE] = 0) and
    [SampleID] in (Select [SampleID] from [tableA] where [ERRORCODE] = 100) 

Would have worked, but it only returns a result if the two [ERRORCODE] checks are for the same code, i.e. 100. Which of course is useless. Its checking line by line rather than doing the first part of the logic gate, then the 2nd.

If I was able to make the first part of the where [ERRORCODE]=0 complete, then downselect from those SampleID's where [ERRORCODE]=100 and repeat, then that'd work. Not sure how to do that though.

edit: Ach FFS. Turns out every single errorcode I was looking must be mutually exclusive with each other - no matter what combination, no two could occur together. I'd assumed that in the size of the DB I had, there would have been a combination somewhere.

I checking my query with errorcodes I can see from an unqualified SELECT * that sit beside each other and did prove it works.

Question is invalid I suppose.

2

There are 2 answers

1
Gordon Linoff On

I'm a little lost. If you want samples with all four error codes, then this should do what you want:

select a.*
from [tableA] a
where a.SampleID in (Select a2.SampleID from tableA a2 where a2.ERRORCODE = 0) and
      a.SampleID in (Select a2.SampleID from tableA a2 where a2.ERRORCODE = 100) and
      a.SampleID in (Select a2.SampleID from tableA a2 where a2.ERRORCODE = 200) and
      a.SampleID in (Select a2.SampleID from tableA a2 where a2.ERRORCODE = 300) ;

This should work, although the question claims that it does not. It is not necessarily going to have the best performance.

I usually recommend aggregation to et the sample ids:

select sampleid
from tablea
where errorcode in (0, 100, 200, 300)
group by sampleid
having count(distinct errorcode) = 4;

If you don't want duplicates as well as covering all the error codes, then use:

having count(distinct errorcode) = 4 and count(*) = 4

This also may not have the best performance in all cases. But the performance is predictable -- it changes little based on the number of codes you are looking for. And the having clause can make this quite versatile.

Then, if you want all the original data, you can use join, in, or exists:

select a.*
from tablea a join
     (select sampleid
      from tablea
      where errorcode in (0, 100, 200, 300)
      group by sampleid
      having count(distinct errorcode) = 4
     ) a2
     on a2.sampleid = a.sampleid;
  
4
Serg On

You can use top(1) with ties .. order by to skip all but first SampleID + ERRORCODE combinations. Then count rows matching with codes, it must be exactly the number of codes.

with codes as (
    select 0 c union all
    select 100 union all
    select 200 union all
    select 300
),
errlog as ( 
   -- take only first occurence of SampleID + ERRORCODE
   select top(1) with ties id,  ResultID,  SampleID,  ERRORCODE, col4, colN
   from [my].[db].[path].[tableA]
   order by row_number() over(partition by SampleID, ERRORCODE order by ResultID)
)
select id,  ResultID,  SampleID,  ERRORCODE, col4, colN
from (
   select t.*, count(*) over(partition by t.SampleID) cnt
   from errlog t
   join codes on codes.c = t.ERRORCODE
) t
where cnt = (select count(*) from codes);