Boxplot Analysis

51 views Asked by At

I'm new to EDA, and I just figured a way to find the boxplot of my numerical data, however, I have trouble getting information for it:

boxplot image:

boxplot image

Can anyone tell me:

  1. What can I expect when looking at boxplot
  2. Why does other than age, they don't seem like a box?
  3. How can I tell if it is an outlier.

By the way, here is the boxplot code im using:

def plot_boxplots_grid(dataframe, n_cols=3, subplot_size=(5, 3)):
    numerical_cols = [col for col in dataframe.columns if dataframe[col].dtype != 'object' and dataframe[col].nunique() > 2]
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols 
    plt.figure(figsize=(subplot_size[0] * n_cols, subplot_size[1] * n_rows))
    for i, col in enumerate(numerical_cols):
        plt.subplot(n_rows, n_cols, i + 1)  
        sns.boxplot(y=dataframe[col])
        plt.title(col)
    plt.tight_layout() 
    plt.show()
1

There are 1 answers

1
Fruity Fritz On

1. What can I expect when I look at a boxplot?

This article explains boxplots well and runs you through the calculations. In short:

A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.

Boxplots will also show outliers depending on how you define those. Outliers are typically data points that are greater/less than 1.5 x IQR where IQR stands for the interquartile range (Q3 - Q1). However, this is just a standard.

2. Why don't they seem like a box?

That is because you have a lot of outliers in your datasets for your other variables - outliers are denoted by the diamond markers. Seaborn is displaying all the data so the high number of outliers is "squeezing" the box area down to a very small height. Try setting showfliers=False in the line where you define your boxplot which should remove the outliers. After you do this, you should see a more standard boxplot.

However, with all the outliers that you have, boxplots do not seem like the correct choice to display your data and by removing outliers, you are hiding important data points and skewing your results. I would suggest violin plots which I find much more visually appealing and you can also plot a box plot with the violin. Seaborn has a seaborn.violinplot() method that you can use with very similar inputs to seaborn.boxplot().

3. How can I tell if it is an outlier?

When using seaborn, the standard is to show outliers with a diamond marker. In this post, it looks like seaborn uses the standard 1.5 x IQR:

It appears, by testing, that seaborn uses whis=1.5 as the default.

where whis is an optional argument to the sns.boxplot() method:

whis : float, optional
    Proportion of the IQR past the low and high quartiles to extend the plot whiskers.
    Points outside this range will be identified as outliers.