I intend to handle skewness of a few columns in a data frame using this code:
upper_limit = df['column1'].mean() + 3*df['column1'].std()
lower_limit = df['column1'].mean() - 3*df['column1'].std()
df['column1'] = np.where(df['column1'] > upper_limit, upper_limit, np.where(df['column1'] < lower_limit, lower_limit, df['column1']))
There won`t be a problem to copy/paste this code separately for any column, but I wanted to have an elegant approach for my pleasure. I wrote a few attempts for a for-loop, but they were too embarrassing to post them.
I was wondering if someone here could come up with an intelligent Pythonista variant - short and beautiful?
PS: I don`t want to drop the outliers and np.log() has already been applied.
@Yes`s variant works for me perfectlly:
def handle_skewness(column):
upper_limit = column.mean() + 3 * column.std()
lower_limit = column.mean() - 3 * column.std()
return np.where(column > upper_limit, upper_limit, np.where(column < lower_limit, lower_limit, column))
#iterate through DataFrame columns
for column in X.columns:
#check if the column is numeric (you can customize this based on what you need)
if np.issubdtype(X[column].dtype, np.number):
X[column] = handle_skewness(X[column])