I am looking into an application that requires to detect the delay in receiving video frames and then takes action if a delay is detected. The delay in receiving video frames is perceived as a video freeze on the render window. The action is insertion of an IMU frame in between the live video since the video freeze has occurred. Following are the pipelines :
The Tx-Rx are connected in an adhoc mode using WiFi with no more devices. Also only video is transmitted, audio is not a concern here.
Tx(iMX6 device):
v4l2src fps-n=30 -> h264encode -> rtph264pay -> rtpbin -> udpsink(port=5000) ->
rtpbin.send_rtcp(port=5001) -> rtpbin.recv_rtcp(port=5002)
Rx(ubuntu PC):
udpsrc(port=5000) -> rtpbin -> rtph264depay -> avdec_h264 -> rtpbin.recv_rtcp(port=5001) ->
rtpbin.send_rtcp(port=5002) -> custom IMU frame insertion plugin -> videosink
Now as per my application, I intend to detect the delay in receiving frames at the Rx device. The delay can be induced by a number of factors including:
- congestion
- packet loss
- noise , etc.
Once the delay is detected, I intend to insert a IMU(inertial measurement unit) frame (custom visualization) in between the live video frame. For eg, if every 3rd frame is delayed, the video will look like:
V | V | I | V | V | I | V | V | I | V | .....
where V - video frame received and I - IMU frame inserted at Rx device
Hence as per my application requirements, to achieve this I must have a knowledge of the timestamp of the video frame sent from Tx, and use this timestamp with the current timestamp at Rx device to get the delay in transmitting.
frame delay = Current time at Rx - Timestamp of frame at Tx
Since I am working at 30 fps, ideally I should expect that I receive video frames at the Rx device every 33ms. Given the situation that its WiFi, and other delays including encoding/decoding I understand that this 33ms precision is difficult to achieve and its perfectly fine for me.
- Since, I am using RTP/RTCP , I had a look into WebRTC but it caters more towards sending SR/RR (network statistics) only for a fraction of the data sent from Tx -> Rx. I also tried using the UDP source timeout feature that detects if there are no packets at the source for a predefined time and issues signal notifying the timeout. However, this works only if the Tx device completely stops(pipeline stopped using Ctrl+C). If the packets are delayed, the timeout does not occur since the kernel buffers some old data.
I have the following questions :
Does it make sense to use the timestamps of each video frame/RTP buffers to detect the delay in receiving frames at the Rx device ? What would be a better design to consider for such an usecase ? Or is it too much overhead to consider the timestamp of each frame/buffer and may be I can consider timestamps of factor of video frames like every 5th video frame/buffer, or every 10 the frame/buffer? Also the RTP packets are not same as FPS, which means for a 30 fps video, I can receive more than 30 RTP buffers in GStreamer. Considering the worst case possible where each alternate frame is delayed the video would have the following sequence :
V | I | V| I | V | I | V | I | V | I | .....
I understand that the precision of each alternate frame can be difficult to handle, so I am targetting a detection and insertion of IMU frame atleast within 66 ms. Also the switching between live video frame and insertion frame is a concern. I use the OpenGL plugins to do IMU data manipulation.
Which timestamps should I be considering at the Rx device? To calculate the delay, I need a common reference between the Tx and Rx device, which I do not have a knowledge about. I could access the PTS and DTS of the RTP buffers, but since no reference was available I could not use this to detect the delay. Is there any other way I could do this?
My caps has the following parameters (only few parameters showed) :
caps = application/x-rtp , clock-rate = 90000, timestamp-offset = 2392035930,seqnum-offset= 23406
Can this be used to calculate the reference at Tx and Rx ? I am not sure if I understand these numbers and how to use them at Rx device to get a reference. Any pointers on understanding these parameters?
- Any other possible approaches that can be undertaken for such an application. My above idea could be too impractical and I am open to suggestions to tackle this issue.
You can get an absolute NTP time from RTP/RTCP. Check upon the RTP RFC. Understand how stream synchronization between streams are done. Basically it is that each audio and video stream know nothing from each other. But each stream does have its own RTP time base and sends information over RTCP what this time base represents in NTP.
So - for each frame you can get its NTP time representation. So assuming your devices are correctly synced to NTP you should be able to compare the received NTP time to the current NTP time of the receiver and you should have - roughly I guess - the delay between the two.
If you have multiple packets per frame that does not make much of a difference. All packets belonging to one frame should carry the same timestamps. So you probably want to catch the first one - and if you receive packets with timestamps you already know you just ignore them.
How accurate that actually is - I don't know. Usually video streams have high peak frames (Key frames) but sending is usually smoothed out to prevent packet loss. That will introduces quite a lot of jitter for measuring things you are trying to do..