Plotting timeseries data with multiple categories

175 views Asked by At

I have a dataset from a production line, which is formatted as time series data. There is a batch column, which indicates the name of the batch (str), and there is a phase column which indicates the phase of the production (str). I am working with the datetime as the index of the pandas DataFrame.

I want to plot this data on timeseries graph, overlaying the data from each phase and distinguishing each batch (i.e. different colour), with each process variable (i.e. temp1, temp2, press1, press2) on a different axis (as per the diagram) How can this be done?

EDIT: for clarity, I need the trends to be plotted against a datetime baseline, otherwise they will not overlay.

enter image description here

Example of the dataset: | datetime | temp1 | temp 2|press1|press2|batch | phase | |:---- |:--: | :--: | :--: | :--: |:--: |:--: | | 2023-02-03 15:45:34| 34.45 | 23.34 | 13.23| 45.5 | 'D' | '10-Wait' | | ... | ... | ... | ... | ... | 'D' | ... | | 2023-02-03 15:55:34| 36.55 | 22.14 | 18.23| 78.5 | 'D' | '20-Initialise'|

To create a similar dataset -to mine- you can use the following code:

import numpy as np
import pandas as pd
import datetime  

date = pd.date_range(start='1/1/2023', end='10/06/2023', freq=datetime.timedelta(seconds=30))
tags = ['temp1','temp2','press1','press2']
data=np.random.rand(len(date),len(tags))
df=pd.DataFrame(data,columns=tags).set_index(date)

batches = ['A','B','C','D','E','F','G']
n=len(batches)
period_start = pd.to_datetime('1/1/2023')
period_end = pd.to_datetime('10/06/2023')
batch_start = (pd.to_timedelta(np.random.rand(n) * ((period_end - period_start).days + 1), unit='D') + period_start)
batch_end = (batch_start + pd.to_timedelta(8,unit='H'))

df_batches = pd.DataFrame(data=[batch_start,batch_end],columns=[batches],index=['start','end']).T

for item in batches:
    start_time = df_batches['start'][item]
    end_time = df_batches['end'][item]
    df.loc[((df.index>=start_time)&(df.index<=end_time)), 'batch'] = item
df.dropna(subset=['batch'],inplace=True)

df['phase']=''
phases = ['10-Wait','20-Initialise','30-Warm','40-Running']

for batch in batches:
    wait_len = int(len(df[df['batch']==batch].index)*0.2)
    init_len = int(len(df[df['batch']==batch].index)*0.4)
    warm_len = int(len(df[df['batch']==batch].index)*0.6)
    run_len = int(len(df[df['batch']==batch].index))
   
    wait_start = df[df['batch']==batch].index[0]
    wait_end = df[df['batch']==batch].index[wait_len]
    init_end = df[df['batch']==batch].index[init_len]
    warm_end = df[df['batch']==batch].index[warm_len]
    run_end = df[df['batch']==batch].index[-1]  
 
    df['phase'].loc[wait_start:wait_end] = phases[0]
    df['phase'].loc[wait_end:init_end] = phases[1]
    df['phase'].loc[init_end:warm_end] = phases[2]
    df['phase'].loc[warm_end:run_end] = phases[3]

df.to_csv('stackoverflowqn.csv')
4

There are 4 answers

0
hamslice On BEST ANSWER

Credit to @AvishWagde who definitely broke the back of the problem. The 1 missing ingredient was having the x-axis of each plot baselined against zero.

The solution to baselining these plots was to create a new Timedelta column which starts from 00:00:00 and goes upwards, in increments of 00:00:30.

In Avish's code he uses:

for batch in batches:
    batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)]
    axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='o')

However, since the Dataframe index is a Datetime, plotting this on the x-axis will not result in a comparison of the data. As stated they need to be plotted against a baseline. Using Timedelta on the x-axis allows comparison of the process data in each phase. In this case the 00:00:00 is taken to be the start of each phase. This dataset was recorded at 30s intervals, and it is necessary to convert the Timedelta from Index to Series, as per this line pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series() which results in this slight change:

for batch in batches:
    batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)] 
    baseline_time = pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series()
    batch_data = batch_data.set_index(baseline_time)
    axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='')

For the full working code:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('stackoverflowqn.csv',index_col=[0])
df.index = pd.to_datetime(df.index)

sns.set_style("whitegrid")

phases = df['phase'].unique()
batches = df['batch'].unique()
variables = ['temp1', 'temp2', 'press1', 'press2']  # List of process variables

num_cols = 2  # Number of columns for the subplot grid
num_rows = (len(variables) + num_cols - 1) // num_cols

for phase in phases:
    fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 6 * num_rows))
    plt.subplots_adjust(hspace=0.5)
    plt.suptitle(phase, y=1.02)

    for idx, variable in enumerate(variables):
        row = idx // num_cols
        col = idx % num_cols

        for batch in batches:
            batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)]
            baseline_time = pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series()
            batch_data = batch_data.set_index(baseline_time)
            axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='')

        axes[row, col].set_title(variable)
        axes[row, col].set_xlabel('time')
        axes[row, col].set_ylabel(variable)
        axes[row, col].legend()

    plt.tight_layout()
    plt.show()

Wait: 2 x variables Output from the working code - showing Wait phase 10 Initialise: 2 x variables Output from the working code - showing Initialise phase 20
Warm: 2 x variables Output from the working code - showing Warm phase 30

5
Avish Wagde On

I tried to create my own data like the one you suggested, and tried to show how to plot, I think this will help you mate!

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample DataFrame for illustration purposes
data = {
    'batch': ['Batch1', 'Batch1', 'Batch1', 'Batch2', 'Batch2', 'Batch2'],
    'phase': ['Phase1', 'Phase2', 'Phase3', 'Phase1', 'Phase2', 'Phase3'],
    'temp1': [100, 110, 105, 95, 105, 98],
    'temp2': [90, 95, 92, 85, 88, 87],
    'press1': [50, 52, 51, 48, 49, 47],
    'press2': [30, 31, 29, 28, 30, 29]
}

df = pd.DataFrame(data)
df['datetime'] = pd.date_range(start='2023-01-01', periods=len(df), freq='D')
df.set_index('datetime', inplace=True)

sns.set_style("whitegrid")

phases = df['phase'].unique()
batches = df['batch'].unique()
variables = ['temp1', 'temp2', 'press1', 'press2']  # List of process variables

num_cols = 2  # Number of columns for the subplot grid
num_rows = (len(variables) + num_cols - 1) // num_cols

for phase in phases:
    fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 6 * num_rows))
    plt.subplots_adjust(hspace=0.5)
    plt.suptitle(phase, y=1.02)

    for idx, variable in enumerate(variables):
        row = idx // num_cols
        col = idx % num_cols

        for batch in batches:
            batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)]
            axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='o')

        axes[row, col].set_title(variable)
        axes[row, col].set_xlabel('Datetime')
        axes[row, col].set_ylabel(variable)
        axes[row, col].legend()

    plt.tight_layout()
    plt.show()


sample output

0
Quang Hoang On

You can use seaborn facetgrid like this:

df = df.rename_axis(index='time').reset_index().melt(['time','batch','phase'])


for p, data in df.groupby('phase', group_keys=False):
    print(p)
    fg = sns.FacetGrid(data=data, col='variable', col_wrap=2, hue='batch')
    fg.map(sns.lineplot, 'time','value')
    
    plt.show()

You would get for each phase a plot like this:

enter image description here

2
Yilmaz On

Breaking down the process of plotting a graph step by step can make the task much more manageable.

you get the data:

raw_data=pd.read_csv("stackoverflowqn.csv",index_col=[0])

its index is date. reset the index and create a new column "date" with datetime type:

data=raw_data.reset_index()
data.columns=['date', 'temp1', 'temp2', 'press1', 'press2', 'batch', 'phase']
data["date"]=pd.to_datetime(data["date"])

Create the groupby object and get the group names in a list:

gbo=data.groupby("phase",as_index=False)
keys=list(gbo.groups.keys())

after that, create dataframe for each group.

list_1=gbo.groups[keys[0]]
frame_1=data[data.index.isin(list_1)]

list_2=gbo.groups[keys[1]]
frame_2=data[data.index.isin(list_2)]

list_3=gbo.groups[keys[2]]
frame_3=data[data.index.isin(list_3)]

list_4=gbo.groups[keys[3]]
frame_4=data[data.index.isin(list_4)]

setting date as one of the axes will not look great. maybe you should add a minute column to one of the frames:

frame_1["minute"]=frame_1["date"].dt.minute

Now you have 4 different new data frames, you just have to plot them. you choose x,y axes.

plt.figure(figsize=(16,10))
plt.suptitle("Main Figure",fontsize=24)
#  2 x 2 grid and I am working on the first plot
plt.subplot(2,2,1)
# by default lineplot uses estimator=mean. you might need to change it
# By default, seaborn line plots show confidence intervals for the dataset. Yremove it by setting by errorbar=None
sns.lineplot(data=frame_1,x="minute",y="temp1",hue="batch",errorbar=None).set(title="MINUTE-TEMP1")
plt.subplot(2,2,2)
sns.lineplot(data=frame_2,x="temp2",y="press1",hue="batch",errorbar=None).set(title="Title_2")

plt.subplot(2,2,3)
sns.lineplot(data=frame_3,x="temp1",y="press1",hue="batch",errorbar=None).set(title="Title_3")

plt.subplot(2,2,4)
sns.lineplot(data=frame_3,x="temp1",y="press1",hue="batch",errorbar=None).set(title="Title_4")

Result is like this. you decide the x,y axes:

enter image description here