How to create stacked bar chart with a time series and aggregated values

8.5k views Asked by At

I am having difficulty creating a stacked bar chart time series from my Pandas dataframe (image below). I would like to have the 'Date' on the x axis, the 'Hours' on the y axis, and each bar to show the time spent with each group in 'Category'.

Sample dataframe data

Do I need to use Pandas - Groupby function? The dataframe is a sample. I have hundreds of rows of data from 2018 to 2020.

1

There are 1 answers

3
Trenton McKinney On BEST ANSWER
  • There is one solution to pandas - stacked bar chart with timeseries data
    • The issue with that question, is that OP is not aggregating any data, so that solution doesn't work for this question.
  • Use pandas.DataFrame.groupby on 'date' and 'group', while aggregating .sum on 'time'
    • The .dt extractor is used to extract only the .date component of the 'date' column.
    • Make certain the 'Date' column of your dataframe is properly formatted as a datetime dtype, with df.Date = pd.to_datetime(df.Date)
  • The grouped dataframe, dfg, must be shaped into the correct form, which can be accomplished with pandas.DataFrame.pivot.
  • The easiest way to stack a bar plot is with pandas.DataFrame.plot.bar and use the stacked parameter.
    • See pandas.DataFrame.plot for all the parameters.

Imports and Data Transformation

import pandas as pd
import matplotlib.pyplot as plt
import random  # for test data
import numpy as np  # for test data

# setup dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'hours': np.random.randint(10, size=(rows)),
        'group': [random.choice(['A', 'B', 'C']) for _ in range(rows)],
        'date': pd.bdate_range('2020-11-24', freq='h', periods=rows).tolist()}
df = pd.DataFrame(data)

# display(df.head())
   hours group                date
0      2     C 2020-11-24 00:00:00
1      4     B 2020-11-24 01:00:00
2      1     C 2020-11-24 02:00:00
3      5     A 2020-11-24 03:00:00
4      2     B 2020-11-24 04:00:00

# use groupby on df
dfg = df.groupby([df.date.dt.date, 'group'])['hours'].sum().reset_index()

# pivot the dataframe into the correct format
dfp = dfg.pivot(index='date', columns='group', values='hours')

# display(dfp.head())
group        A   B   C
date                  
2020-11-24  49  25  29
2020-11-25  62  18  57
2020-11-26  42  77   4
2020-11-27  34  43  17
2020-11-28  28  53  23
  • More succinctly, the groupby and pivot step can be replaced with .pivot_table, which both reshapes and aggregates
    • index=df.date.dt.date is used so the index doesn't include the time component, since the data for the entire day is being aggregated.
dfp = df.pivot_table(index=df.date.dt.date, columns='group', values='hours', aggfunc='sum')

Plot

# plot the pivoted dataframe
dfp.plot.bar(stacked=True, figsize=(10, 6), ylabel='Hours', xlabel='Date', title='Sum of Daily Category Hours')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

enter image description here

  • There will be a bar for each day, this is how bar plot ticks work, so the plot could be very wide, if there are many dates.
  • Consider using pandas.DataFrame.barh
dfp.plot.barh(stacked=True, figsize=(6, 10), title='Sum of Daily Category Hours')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xlabel('Hours')
plt.ylabel('Date')
plt.show()

enter image description here

  • The OP states there is data from 2018 to 2020, which means there could be over 700 days worth of data, which translates to over 700 bars in the bar plot.
  • A standard line plot might be the best option to properly visualize the data.
dfp.plot(figsize=(10, 6))
plt.show()

enter image description here