I have an example test.csv file which is read into a pandas dataframe.
It has 20 rows and 7 columns.
The csv file captures information about SIP calls but the SIP messages for each call are incorrectly ordered. There are 2 SIP calls in this example and separated by an empty row.
The problem I am trying to solve is to reorder the sip messages correctly.
>>> dataframe = pd.read_csv('test.csv')
>>> print(dataframe)
frame.number frame.time ip.src ip.dst sip.Call-ID sip.Method sip.Status-Code
0 25355.0 May 9, 2019 15:57:01.433623000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 ACK NaN
1 25148.0 May 9, 2019 15:57:01.363890000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 200.0
2 18371.0 May 9, 2019 15:56:59.411452000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 INVITE NaN
3 18403.0 May 9, 2019 15:56:59.421261000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 100.0
4 25134.0 May 9, 2019 15:57:01.360769000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 183.0
5 20875.0 May 9, 2019 15:57:00.064251000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 180.0
6 19244.0 May 9, 2019 15:56:59.694785000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 100.0
7 19227.0 May 9, 2019 15:56:59.690747000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 INVITE NaN
8 19022.0 May 9, 2019 15:56:59.620685000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 407.0
9 19221.0 May 9, 2019 15:56:59.689779000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 ACK NaN
10 NaN NaN NaN NaN NaN NaN NaN
11 25356.0 May 9, 2019 15:57:01.433623000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 ACK NaN
12 25149.0 May 9, 2019 15:57:01.363890000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 200.0
13 18372.0 May 9, 2019 15:56:59.411452000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 INVITE NaN
14 18404.0 May 9, 2019 15:56:59.421261000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 100.0
15 25135.0 May 9, 2019 15:57:01.360769000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 183.0
16 20876.0 May 9, 2019 15:57:00.064251000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 180.0
17 19245.0 May 9, 2019 15:56:59.694785000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 100.0
18 19228.0 May 9, 2019 15:56:59.690747000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 INVITE NaN
19 19023.0 May 9, 2019 15:56:59.620685000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 407.0
20 19222.0 May 9, 2019 15:56:59.689779000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 ACK NaN
After successfully re-ordering the rows of the dataframe, I'm going to insert a new column and classify the calls.
dataframe.insert(0, "Classified", " ")
I can't classify the calls correctly if the SIP messages aren't in order.
I've checked out pandas sort_index() and sort_values() but this only helps with one part of the logic to solve this problem.
>>> dataframe.sort_values(by=['sip.Call-ID'], inplace=True)
This will sort the csv file based on the sip.Call-ID column. These values are unique for every SIP call. So this allows the messages to be grouped together.
The values from frame.number should help in solving this problem. However, they can only be sorted based on each unique sip-call-id and not as a whole, otherwise we'll have overlapping sip calls. The pseudo code logic in my mind is as follows:
for each unique sip-call-id in dataframe:
store it's related frame.number
check if the next frame.number is smaller/bigger
reorder rows based on condition above
The difficulty I'm having is accessing the indices of each row and knowing how to re-order them based on each unique sip call id and applying that re-order to the dataframe.
>>> frame_values = dataframe['frame.number'].values
>>> print(frame_values)
[25355. 25148. 18371. 18403. 25134. 20875. 19244. 19227. 19022. 19221.
nan 25356. 25149. 18372. 18404. 25135. 20876. 19245. 19228. 19023.
19222.]
The expected results are seen below. The frame numbers are in ascending order for each unique sip-call-id and the relevant SIP messages are now in order as-well. This is further clarified by the frame time, as they are also in ascending order. This means that the SIP messages are definitely in order.
By relevant SIP messages, I mean sip.Method & sip.Status-Code columns are now in order.
44.44.44.44 55.55.55.55
| |
| INVITE | First SIP Message: INVITE Method
|----------------------->|
| 100 trying | Second SIP Message: 100 Status Code
|<-----------------------|
| 407 Proxy Auth | Third SIP Message: 407 Status Code
|<-----------------------|
| |
| ACK | Fourth SIP Message: ACK Method
|----------------------->|
| INVITE | Fifth SIP Message: INVITE Method
|----------------------->|
| |
| 100 trying | Sixth SIP Message: 100 Status Code
|<-----------------------|
| 180 ringing | Seventh SIP Message: 180 Status Code
|<-----------------------|
| 183 session | Eight SIP Message: 183 Status Code
|<-----------------------|
| 200 OK | Ninth SIP Message: 200 Status Code
|<-----------------------|
| ACK | Tenth SIP Message: ACK Method
|----------------------->|
>>> print(dataframe)
frame.number frame.time ip.src ip.dst sip.Call-ID sip.Method sip.Status-Code
0 18371.0 May 9, 2019 15:56:59.411452000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 INVITE NaN
1 18403.0 May 9, 2019 15:56:59.421261000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 100.0
2 19022.0 May 9, 2019 15:56:59.620685000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 407.0
3 19221.0 May 9, 2019 15:56:59.689779000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 ACK NaN
4 19227.0 May 9, 2019 15:56:59.690747000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 INVITE NaN
5 19244.0 May 9, 2019 15:56:59.694785000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 100.0
6 20875.0 May 9, 2019 15:57:00.064251000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 180.0
7 25134.0 May 9, 2019 15:57:01.360769000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 183.0
8 25148.0 May 9, 2019 15:57:01.363890000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e25a5083a8f2c85 NaN 200.0
9 25355.0 May 9, 2019 15:57:01.433623000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e25a5083a8f2c85 ACK NaN
10 NaN NaN NaN NaN NaN NaN NaN
11 18372.0 May 9, 2019 15:56:59.411452000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 INVITE NaN
12 18404.0 May 9, 2019 15:56:59.421261000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 100.0
13 19023.0 May 9, 2019 15:56:59.620685000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 407.0
14 19222.0 May 9, 2019 15:56:59.689779000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 ACK NaN
15 19228.0 May 9, 2019 15:56:59.690747000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 INVITE NaN
16 19245.0 May 9, 2019 15:56:59.694785000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 100.0
17 20876.0 May 9, 2019 15:57:00.064251000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 180.0
18 25135.0 May 9, 2019 15:57:01.360769000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 183.0
19 25149.0 May 9, 2019 15:57:01.363890000 IST 55.55.55.55 44.44.44.44 0018506d493d00005e234fs23osd9212 NaN 200.0
20 25356.0 May 9, 2019 15:57:01.433623000 IST 44.44.44.44 55.55.55.55 0018506d493d00005e234fs23osd9212 ACK NaN