SAS Proc Standard and the equivalent in Python

47 views Asked by At

In SAS, Proc Standard allow users to standardise data into a certain mean and standard deviation by a certain group. Here, I want to standarize age to a of mean = 0 and deviation =5 for each surname. How can I do so in Pandas df?

SAS code:

data mydata;
input surname $ name $ age ;
datalines;
Lim John 25 
Lim David 100 
Tan Mary 50 
Tan Tom 30 ;
run;

PROC STANDARD MEAN=0 STD=5 DATA=mydata OUT=mydata11;
VAR age;
BY surname; 
 run;

SAS Output

surname name    age
Lim John    -3.535533906
Lim David   3.5355339059
Tan Mary    3.5355339059
Tan Tom -3.535533906
1

There are 1 answers

0
Stu Sztukowski On BEST ANSWER

Following this answer from stats.stackexchange (Transform Data to Desired Mean and Standard Deviation), we can define a function that does this and apply it in a lambda function:

df = pd.DataFrame({'surname': ['Lim', 'Lim', 'Tan', 'Tan'],
                   'name':    ['John', 'David', 'Mary', 'Tom'],
                   'age':     [25, 100, 50, 30]})

def standard(x, mean, std):
    x_std = mean + (x-x.mean()) * std/x.std()
    return x_std

df['age_standard'] = (df.groupby('surname')['age']
                        .apply(lambda x: standard(x, 0, 5))
                        .reset_index(drop=True)
                     )
  surname   name  age   age_standard
0     Lim   John   25      -3.535534
1     Lim  David  100       3.535534
2     Tan   Mary   50       3.535534
3     Tan    Tom   30      -3.535534

We can confirm this:

df.groupby('surname')['age_standard'].mean() # Should be 0
df.groupby('surname')['age_standard'].std()  # Should be 5