Counting the number of join symptoms

132 views Asked by At

I'm trying to do something that's very simple to do in other languages but in SQL it's proving rather puzzling.

I have a database with the patient ID as row, and 100 symptoms as columns. Symptoms are binary, 0 or 1 if the patient has it or not. Let's say Patient 1 has 0, 1, 1, ... for coughing, sneezing, headaches, etc. Patient 2 similarly has a binary list of symptoms 1, 0, 1, ....

What I want to do is simply generate a table that has counts for pairwise symptoms that number of patients that have that combination of symptoms.

So when I look up in the new table, I can see how many people have both sneezing and coughing. How many have both headache and sneezing and so forth. Only pairwise. If I look up sneezing and sneezing, it would give me back the number of people who have sneezing. Something like a matrix format of symptoms in both rows and columns and number of patients suffering from it in the content.

With a for loop this is incredibly easy in any other language. I'm new to SQL though, and struggling to figure out an efficient way to do this.

3

There are 3 answers

0
mab On BEST ANSWER

This answer by @GarethD, helped solved the problem: Thanks!

The key is to unpivot your data so that you have one row per patient per symptom, then join this data to itself to get pairs of symptoms, then pivot the joined data back up to get your counts. Since I don't know which DBMS to use, I cannot answer the question, I have however created a working demo in SQL Server that should demonstrate how to do it. – GarethD yesterday

1
Alisa On

Your question needs to write functions and using cursors in them.

However, there is an alternative approach:

Suppose you have a table with four columns:

a   b   c   d
-------------------------
1   0   1   1
1   1   0   0
0   0   1   0
0   0   0   1
1   1   1   1
0   1   0   1
1   0   1   0
0   1   1   1
0   0   1   1
1   0   1   0

This is a the answer:

Select sum(a) as a_a, 
    (select count(*) from patients where a=1 and b=1 as a_b) as a_b, 
    (select count(*) from patients where a=1 and c=1 as a_c) as a_c, 
    (select count(*) from patients where a=1 and d=1 as a_d) as a_d, 
    sum(b) as b_b, 
    (select count(*) from patients where b=1 and c=1 as b_c) as b_c, 
    (select count(*) from patients where b=1 and d=1 as b_d) as b_d, 
    sum(c) as c_c, 
    (select count(*) from patients where c=1 and d=1 as c_d) as c_d, 
    sum(d) as d_d 

Now, the result is like this:

a_a     a_b     a_c     a_d     b_b     b_c     b_d     c_c     c_d     d_d
-------------------------------------------------------------------------
5       2       3       2       4       2       2       7       4       6

It is not like a matrix; it has only one row, but it has everything you would like to have. You can expand it to your own table with many fields.

5
Alisa On

You want to study the interplay between different symptoms, right?

In this case, it is better to get the correlation between different symptoms and SQL does not suite this problem; you need to convert the table to a csv file, then, using R (or even excel), you can get the correlation.

Suppose this is your CSV file (C:/dataFile.csv):

a, b, c, d, e, f
----------------
1, 1, 1, 0, 1, 0
1, 1, 0, 1, 1, 1
0, 0, 0, 1, 0, 0
0, 1, 1, 0, 1, 0
1, 0, 0, 0, 1, 0
0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0
1, 1, 0, 1, 1, 1

Now, in R Statistics, you can run the following commands one by one:

> data <- read.csv("C:/dataFile.csv")
> summary(data)
> cor(data)

And this is the result:

          a         b          c           d          e          f
a 1.0000000 0.5000000  0.0000000  0.25819889 0.77459667  0.5773503
b 0.5000000 1.0000000  0.5773503  0.25819889 0.77459667  0.5773503
c 0.0000000 0.5773503  1.0000000 -0.44721360 0.44721360 -0.3333333
d 0.2581989 0.2581989 -0.4472136  1.00000000 0.06666667  0.7453560
e 0.7745967 0.7745967  0.4472136  0.06666667 1.00000000  0.4472136
f 0.5773503 0.5773503 -0.3333333  0.74535599 0.44721360  1.0000000

Two symptoms with higher correlations means that those two change mostly together. For example, [a and e] or [b and e] are highly correlated.

I hope that give you a broader idea of how to deal with data analysis.