Square a matrix in python

169 views Asked by At

Hello let say I have a df such as :

G1  G2  VALUE 
SP1 SP2 1
SP1 SP3 2
SP1 SP4 3
SP2 SP3 4
SP2 SP4 5
SP3 SP4 6 

how can I get a the data as square ? (i.e., have the same number of rows and columns)

with something like

data = [[0,  1,  2,  3],
[1,  0, 4, 5],
[9, 10,  0,  8,  7],
[2, 4,  0,  6],
[3,  5,  6,  0]]

ids = ['SP1','SP2','SP3','SP4]

dm = DistanceMatrix(data, ids) (function from skbio package)

and get a matrix :

    SP1 SP2 SP3 SP4
SP1 0   1   2   3
SP2 1   0   4   5
SP3 2   4   0   6
SP4 3   5   6   0

And if some of you are familiar with it, how can we also do the same thing but with a 1/2 matrix :

SP1 0 
SP2 1   0   
SP3 2   4   0   
SP4 3   5   6   0
    SP1 SP2 SP3 SP4

(here is mor for biopython) thanks a lot for your help


Other exemple

d = {'G1': ['SP1','SP2','SP2'], 'G2': ['SP3','SP3','SP1'],'VALUE' :[1,2,3]}
df = pd.DataFrame(data=d)

I should get :

SP1 0
SP2 3   0
SP3 1   2  0
   SP1 SP2 SP3 

and

SP1 0   3  1
SP2 3   0  2
SP3 1   2  0
   SP1 SP2 SP3 
2

There are 2 answers

1
inspectorG4dget On BEST ANSWER

I think this is what you're looking for, more or less:

In [257]: df
Out[257]: 
    G1   G2  VALUE
0  SP1  SP2      1
1  SP1  SP3      2
2  SP1  SP4      3
3  SP2  SP3      4
4  SP2  SP4      5
5  SP3  SP4      6

In [258]: df.pivot(index='G1', columns='G2', values='VALUE')
Out[258]: 
G2   SP2  SP3  SP4
G1                
SP1  1.0  2.0  3.0
SP2  NaN  4.0  5.0
SP3  NaN  NaN  6.0

In [259]: df.pivot(index='G1', columns='G2', values='VALUE').fillna(value=0)
Out[259]: 
G2   SP2  SP3  SP4
G1                
SP1  1.0  2.0  3.0
SP2  0.0  4.0  5.0
SP3  0.0  0.0  6.0

In response to the edit to the question:

In [277]: d = {'G1': ['SP1','SP2','SP2'], 'G2': ['SP3','SP3','SP1'],'VALUE' :[1,2,3]}

In [278]: df = pd.DataFrame(data=d)

In [279]: d = df.pivot(index='G1', columns='G2', values='VALUE').fillna(value=0).to_dict()

In [280]: for s,dd in {**d}.items(): 
     ...:     for t,v in {**dd}.items(): 
     ...:         d.setdefault(t, {})[s] = v 
     ...:

In [281]: d
Out[281]: 
{'SP1': {'SP1': 0.0, 'SP2': 3.0, 'SP3': 1.0},
 'SP3': {'SP1': 1.0, 'SP2': 2.0},
 'SP2': {'SP1': 3.0, 'SP3': 2.0}}

In [282]: pd.DataFrame(data=d)
Out[282]: 
     SP1  SP3  SP2
SP1  0.0  1.0  3.0
SP2  3.0  2.0  NaN
SP3  1.0  NaN  2.0

In [283]: pd.DataFrame(data=d).fillna(value=0)
Out[283]: 
     SP1  SP3  SP2
SP1  0.0  1.0  3.0
SP2  3.0  2.0  0.0
SP3  1.0  0.0  2.0
1
Dani Mesejo On

You could use numpy.unique, crosstab and reindex:

import numpy as np

# find unique values from both columns (flattened)
idx = np.unique(df[['G1', 'G2']])

# cross tabulation of G1 and G2
res = pd.crosstab(index=df['G1'], columns=df['G2'], values=df['VALUE'], aggfunc='sum')

# reindex using unique values from both columns
res = res.reindex(index=idx, columns=idx, fill_value=0).fillna(0)

print(res)

Output

G2   SP1  SP2  SP3  SP4
G1                     
SP1  0.0  1.0  2.0  3.0
SP2  0.0  0.0  4.0  5.0
SP3  0.0  0.0  0.0  6.0
SP4  0.0  0.0  0.0  0.0

The first step:

# find unique values from both columns (flattened)
idx = np.unique(df[['G1', 'G2']])

creates:

['SP1' 'SP2' 'SP3' 'SP4']

the second step:

# cross tabulation of G1 and G2
res = pd.crosstab(index=df['G1'], columns=df['G2'], values=df['VALUE'], aggfunc='sum')

produces:

G2   SP2  SP3  SP4
G1                
SP1  1.0  2.0  3.0
SP2  NaN  4.0  5.0
SP3  NaN  NaN  6.0

then with the values obtained in step 1 reindex the DataFrame from step 2:

# reindex using unique values from both columns
res = res.reindex(index=idx, columns=idx, fill_value=0).fillna(0)