How to create boxplots from a pandas column of strings

214 views Asked by At

I'm trying to plot arrays as boxplot from a dataframe as the second picture here.

An extract of my data (I have data over 6 years, 150 per year) :

columns : idx | id | mods | Mean(Moyennes) | Median | Values_array | date2021

idx1 | 2021012 | Day | 273.7765808105 | 273.5100097656 |
272.3800048828,272.3800048828,272.3999938965,272.3999938965,276.5199890137,274.3800048828,274.3800048828 |2021-12-01T00:00:00.000Z

idx2 | 2021055 | Night| 287.5215759277 | 287.6099853516 | 286.0400085449,286.0400085449,286.0400085449,286.0400085449,284.8599853516,285.0400085449,285.0400085449,286.7200012207,286.799987793,286.799987793,287,288.2399902344,288.2399902344 |2021-02-24T00:00:00.000Z

Here is my data plotted with sns.relplot
Here is my data plotted with sns.relplot

To plot it, I tried :

sns.boxplot(data=df2018, x="Moyennes", y="date2018", hue = "mods")

It turns out, it looks like this
It turns out, it looks like this

I don't understand why the date turns out like this and not like with sns.relplot. Also, I want to boxplot my array as a all because in my understanding you have to put an array for it to compute mean, median etc ..

I also tried :

for i, j in sorted(df2017.iterrows()):
    values = j[4]
    date = j[6]
    id=j[0]
    fig, ax1 = plt.subplots(figsize=(10, 6))
    fig.canvas.manager.set_window_title('Température 2020')
    fig.subplots_adjust(left=0.075, right=0.95, top=0.9, bottom=0.25)
    bp = ax1.boxplot(values, notch=False, sym='+', vert=True, whis=1.5)
    plt.setp(bp['boxes'], color='black')
    plt.setp(bp['whiskers'], color='black')
    plt.setp(bp['fliers'], color='red', marker='+')

the output is like this, which is nice but I want every boxplot of on year to be in the same plot.

like this

I'm working on vscode, vm linux.

My question is, how can I boxplot several arrays with seaborn?

1

There are 1 answers

0
Trenton McKinney On BEST ANSWER
  • The primary issue is cleaning and reshaping the pandas dataframe:
    • The column 'Values_array' is a string of comma separate numbers, which must be converted to separate rows, and then set to float type.
  • Depending on the data, use the figure-level method sns.catplot with kind='box', or the axes-level method sns.boxplot.
    • Explore the col, col_wrap, and row parameters for subplots (facets) with sns.catplot.
  • Tested in python 3.11.2, pandas 2.0.0, matplotlib 3.7.1, seaborn 0.12.2
import pandas as pd
import seaborn as sns

# sample data
data = {'idx': ['idx1 ', 'idx2 '],
        'id': [2021012, 2021055],
        'mods': ['Day', 'Night'],
        'Mean(Moyennes)': [273.7765808105, 287.5215759277],
        'Median': [273.5100097656, 287.6099853516],
        'Values_array': ['272.3800048828,272.3800048828,272.3999938965,272.3999938965,276.5199890137,274.3800048828,274.3800048828', '286.0400085449,286.0400085449,286.0400085449,286.0400085449,284.8599853516,285.0400085449,285.0400085449,286.7200012207,286.799987793,286.799987793,287,288.2399902344,288.2399902344'],
        'date2021': ['2021-12-01T00:00:00.000Z', '2021-02-24T00:00:00.000Z']}
df = pd.DataFrame(data)

# convert the column to a datetime.date type since there's no time component
df.date2021 = pd.to_datetime(df.date2021).dt.date

# split the strings in the Values_array column
df.Values_array = df.Values_array.str.split(',')

# explode the list of strings to individual rows
df = df.explode(column='Values_array', ignore_index=True)

# set the type of the Values_array column to float
df.Values_array = df.Values_array.astype(float)

# plot the data in a single facet
g = sns.catplot(data=df, x='date2021', y='Values_array', kind='box')

enter image description here

# same plot with sns.boxplot instead of sns.catplot
g = sns.boxplot(data=df, x='date2021', y='Values_array')

enter image description here

df before cleaning

     idx       id   mods  Mean(Moyennes)      Median                                                                                                                                                                           Values_array                  date2021
0  idx1   2021012    Day      273.776581  273.510010                                                                               272.3800048828,272.3800048828,272.3999938965,272.3999938965,276.5199890137,274.3800048828,274.3800048828  2021-12-01T00:00:00.000Z
1  idx2   2021055  Night      287.521576  287.609985  286.0400085449,286.0400085449,286.0400085449,286.0400085449,284.8599853516,285.0400085449,285.0400085449,286.7200012207,286.799987793,286.799987793,287,288.2399902344,288.2399902344  2021-02-24T00:00:00.000Z

df after cleaning

      idx       id   mods  Mean(Moyennes)      Median  Values_array    date2021
0   idx1   2021012    Day      273.776581  273.510010    272.380005  2021-12-01
1   idx1   2021012    Day      273.776581  273.510010    272.380005  2021-12-01
2   idx1   2021012    Day      273.776581  273.510010    272.399994  2021-12-01
3   idx1   2021012    Day      273.776581  273.510010    272.399994  2021-12-01
4   idx1   2021012    Day      273.776581  273.510010    276.519989  2021-12-01
5   idx1   2021012    Day      273.776581  273.510010    274.380005  2021-12-01
6   idx1   2021012    Day      273.776581  273.510010    274.380005  2021-12-01
7   idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
8   idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
9   idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
10  idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
11  idx2   2021055  Night      287.521576  287.609985    284.859985  2021-02-24
12  idx2   2021055  Night      287.521576  287.609985    285.040009  2021-02-24
13  idx2   2021055  Night      287.521576  287.609985    285.040009  2021-02-24
14  idx2   2021055  Night      287.521576  287.609985    286.720001  2021-02-24
15  idx2   2021055  Night      287.521576  287.609985    286.799988  2021-02-24
16  idx2   2021055  Night      287.521576  287.609985    286.799988  2021-02-24
17  idx2   2021055  Night      287.521576  287.609985    287.000000  2021-02-24
18  idx2   2021055  Night      287.521576  287.609985    288.239990  2021-02-24
19  idx2   2021055  Night      287.521576  287.609985    288.239990  2021-02-24