IDirectXVideoDecoder performance

290 views Asked by At

I am trying to understand some of the nuances of IDirectXVideoDecoder. CAVEAT: The conclusions stated below are not based on the DirectX docs or any other official source, but are my own observations and understandings. That said...

In normal use, IDirectXVideoDecoder is easily fast enough to process frames at any sensible frame rate. However, if you aren't rendering the frames based on timecodes and instead are going "as fast as possible," then you eventually run into a bottleneck in the decoder and IDirectXVideoDecoder::BeginFrame starts returning E_PENDING.

Apparently at any given time, a system can only have X frames active in the decoder. Attempting to submit X + 1 gives you this error until one of the previous frames completes. On my (somewhat older) box, X == 4. On my newer box, X == 8.

Which brings us to my first question:

Q1: How do I find out how many simultaneous decoding operations a system supports? What property/attribute describes this?

Then there's the question of what to do when you hit this error. I can think of 3 different approaches, but they all have drawbacks:

1) Just do a loop waiting for a decoder to free up:

do {
  hr = m_pVideoDecoder->BeginFrame(pSurface9Video[y], NULL);
} while(hr == E_PENDING);

On the plus side, this approach gives the fastest throughput. On the minus side, this causes a massive amount of CPU time to get burned waiting for a decoder to free up (>93% of my execution time gets spent here).

2) Do a loop, and add a Sleep:

do {
  hr = m_pVideoDecoder->BeginFrame(pSurface9Video[y], NULL);
  if (hr == E_PENDING)
     Sleep(1);
} while(hr == E_PENDING);

On the plus side, this significantly drops the CPU utilization. But on the minus side, it ends up slowing down the total throughput.

In trying to figure out why it's slowing things down, I made a few observations:

  • Normal time to process a frame on my system is ~4 milliseconds.
  • Sleep(1) can Sleep for as much as 8 milliseconds, even when there are CPUs available to run on.
  • Frames sent to the decoders aren't being added to a queue and decoded one at a time. It actually performs X decodings at the same time.

The result of all this is that if you try to Sleep, one of the decoders frequently ends up sitting idle.

3) Before submitting the next frame for decoding, wait for one of the previous frames to complete:

// LockRect doesn't return until the surface is ready.
D3DLOCKED_RECT lr;

// I don't think this matters.  It may always return the whole frame.
RECT r = {0, 0, 2, 2};
hr = pSurface9Video[old]->LockRect(&lr, &r, D3DLOCK_READONLY);
if (SUCCEEDED(hr))
    pSurface9Video[old]->UnlockRect();

This also drops the CPU usage, but it also has a throughput penalty. Maybe due to the 'surface' being in use longer than the 'decoder,' but more likely because the amount of time it takes to (pointlessly) transfer the frame back to memory.

Which brings us to the second question:

Q2: Is there some way here to maximize throughput without pointlessly pounding on the CPU?

Final thoughts:

  • It appears that LockRect must be doing a WaitForSingleObject. If I had access to that handle, waiting on it (without also copying the frame back) seems like it would be the best solution. But I can't figure out where to get it. I've tried GetDC, GetPrivateData, even looking at the debug data members for IDirect3DSurface9. I'm not finding it.
  • IDirectXVideoDecoder::EndFrame outputs a handle in a parameter named pHandleComplete. This sounds like exactly what I need. Unfortunately it is marked as "reserved" and doesn't seem to work. Unless there is a trick?
  • I'm pretty new to DirectX, so maybe I've got this all wrong?

Update 1:

Re Q1: Turns out both my machines only support 4 decoders (oops). This will make it harder to determine which property I'm looking for. While very few properties (none actually) return 8 on one machine and 4 on the other, there are several that return 4.

Re Q2: Since the (4) decoders are (presumably) shared between apps, the idea of finding out if the decoding is complete by (somehow) querying to see if the decoder is idle is a non-starter.

The call to create surfaces doesn't create handles (handle count stays the same across the call). So the idea of waiting on the "surface's handle" doesn't seem like it's going to pan out either.

The only idea I have left is to see if the surface is available by making some other call (besides LockRect) using it. So far I've tried calling StretchRect and ColorFill on a surface that the decoder is "still using," but they complete without error instead of blocking like LockRect.

There may not be a better answer here. So far it appears that for best performance, I should use #1. If CPU utilization is an issue, #2 is better than #1. If I'm going to be reading the surfaces back to memory anyway, then #3 makes sense, otherwise, stick with 1 or 2.

0

There are 0 answers