I have a big dataframe (1M lines) with two columns A,B.
For each couple a_i, a_j of A, I want to know the number of b in B such that there is both rows a_i,b and a_j,b
For example :
A | B |
---|---|
a1 | b1 |
a2 | b1 |
a3 | b2 |
a1 | b3 |
a3 | b3 |
a4 | b3 |
a2 | b4 |
a1 | b5 |
a3 | b5 |
a3 | b6 |
a4 | b6 |
here I have, among others, the couple a1,a3 that share b3 and b5
The result would be the following matrix (that, by definition, is symetric) :
a1 | a2 | a3 | a4 | |
---|---|---|---|---|
a1 | xx | 1 | 2 | 1 |
a2 | 1 | xx | 0 | 0 |
a3 | 2 | 0 | xx | 2 |
a4 | 1 | 0 | 2 | xx |
I think that the following would work:
df = pd.DataFrame({'A' : ['a1','a2','a3','a1','a3','a4','a2','a1','a3','a3','a4'],
'B':['b1','b1','b2','b3','b3','b3','b4','b5','b5','b6','b6']})
df_dum = df.set_index('A')['B'].str.get_dummies().reset_index()
df_dum = df_dum.groupby('A').sum()
np_cnt = df_dum.to_numpy()
np_mul = np.matmul(np_cnt,np_cnt.T)
but it takes way too much time and memory and does not run with my 1M rows. In addition, the diagonal is coumputed whereas I don't need it and I think that passing through the dummy is not so much a good idea especially because the resulting binary is really sparse.
But I don't have any more ideas...
What would you propose ?
EDIT:
for a little bit more of context, let's say that A are students and B are courses. In the end I want to know for any two students, how much courses they have together. And that for every couple of students that at least share a course. If that makes more sense :)
Try with
itertools.permutations
:If you want to mask the cells where index and columns are the same with "xx":