Context:
I'm building an html5 video player with adaptive streaming based on the Media Source Extension protocol. I'm using mp4.
Problem:
I have two versions of the same video (let's say high & bad quality) and I want to be able to switch between versions with very little delay. The issue is that when changing version, I need to have a fragment that starts with a key frame, and having key frames very often in the video is very bad for bandwidth.
I'm looking for a way to send a fragment that starts with a key frame when the user changes version, and a fragment with no key frames else (I'm aware of a bug in Chromium about having a fragment with no key frame, but let's ignore that for now and it's about to be fixed)
I thought of duplicating each stream in a video with a lot of key frames, and in another without (except the first frame obviously) And then only using the stream with key frames when switching video version. Something that would look like this:
// *
// * represents a key frame; * represents a normal frame; a fragment has 4 frames
*
Stream A.1 **** **** **** **** **** **** **** // version A with no key frames
* * * * * * *
Stream A.2 **** **** **** **** **** **** **** // version A with key frames
// at the beginning of each fragment
.
Stream B.1 .... .... .... .... .... .... ....
. . . . . . .
Stream B.2 .... .... .... .... .... .... ....
* .
A -> B **** **** .... .... .... .... ....
from A.1 A.2 B.1 B.2 B.2 B.2 B.2
So each frame is either a key frame, or a normal frame whose predecessor can be decoded successfully. And this would limit the number of key frames send over the wire to the minimum.
But hey! switching from A1
to A2
is understood by the browser as changing the video stream and doesn't work since A2
doesn't start with a key frame.
Has anyone a clever idea of how such a result could be achieved? I'm currently thinking of rewriting the moov and moof atoms in the client to trick the player to think everything is as it excepts. But I don't know much about it...
Motivation:
I'm working on a 360 player. 360 is hard because there's a big part of the video that is streamed but not shown, which means that with constrained bandwidth the part of the video that is shown is of much lower quality that what people are used to. There are tools and techniques out there to generate several versions of the video that are each centered in a different view direction, and then the player decides which version to stream at runtime.
Since the user can change view direction at anytime, it is very important to be able to react quickly to such change, much more than it would be for byte rate adaptation. And since the goal of this thing is to save bandwidth, it would be bad to start by adding lots and lots of keyframes !
Also, since iOS Safari doesn't support inline video, which is key to a 360 player, I'm fine relying on the MSE that is not supported by iOS Safari (Seriously, what are those guys doing?)
every fragment needs to start with a keyframe so switching can happen correctly; to make this work your keyframe interval should divide evenly the fragment duration, for example correct combinations are: