Saturday, May 31, 2008

Video Scene Detection with DirectShow.NET

For some time I've been working on a video-related personal project. I'm using the fantastic DirectShow .NET library, which provides a nice C# interface to Microsoft's DirectShow C++ API. At one point some folks on the DS .NET forums asked about the scene detection algorithm I referenced in one of my forum posts. I promised to follow up with some sample code and explanations and--finally--here they are.

I've created a sample solution to demonstrate my scene detection algorithm. It's based on the DxScan sample available with other DS .NET samples on the DS .NET download page. My algorithm is not yet production code but has proven very reliable in my own testing. It is 100% accurate against my test video library, which is 600 minutes of actual sports video with 1,800 scene changes (including both night and daytime events) plus several short test videos created explicitly to strain the algorithm.

At a high level, scene detection involves the following steps:

  1. Randomly select 2,000 of the RGB values composing a single video frame. These are the values on which we'll perform a longitudinal (or cross-frame) analysis to detect scene changes for the entire duration of the video.
  2. Analyze the current frame:
    1. Calculate the average RGB value for the current frame. If the RGB values are unusually low or high we're detecting scenes shot in bright or dim light conditions and will need to raise or lower our scene detection thresholds accordingly.
    2. Perform an XOR diff between the RGB values in the previous and current frames. The XOR diff amplifies minor differences between frames (vs a simple integer difference) which improves detection of scene changes involving similar scenes as well as detection in low-light conditions where we tend to be dealing with lower RGB values.
    3. Calculate the average RGB difference between the current and previous frames. In other words, add up the XOR diff values from step 2.2 and divide by the number of sample frames.
    4. Calculate the change in average RGB difference between the current and previous frames. This is a bit tricky to understand, but it's critical to achieving a high level of accuracy when differentiating between new scenes and random noise (such as high-motion close-ups or quick pans/zooms). If the previous frame's change in average RGB difference is above a defined, positive threshold (normalized for light conditions detected in step 2.1) and the current frame's change in average RGB difference is below a defined, negative threshold, then the previous frame is flagged as a scene change. In simple terms, we're taking advantage of the fact that scene changes nearly always result in a two-frame spike/crash in frame-to-frame differences; while pans, zooms, and high-motion close-ups result in a gradual ramp-up/ramp-down in frame-to-frame differences.
    5. Advance to the next frame and repeat step 2.

I'll try to expand and clarify the above steps when I have time, but for now you'll have to read the code if you need to understand the algorithm in more detail. The only limitations in the current implementation (that I'm aware of) are the following:

  1. Dropped frames are interpreted as scene changes. This issue can be minimized in most applications by choosing a minimum scene duration and discarding new-scene events fired by the SceneDetector inside the minimum-duration window.
  2. Scene transition effects (fades, dissolves, etc.) are not supported and scene changes involving such effects are not detected.

If you encounter any other issues with the algorithm, I'd love the opportunity to see and analyze the video that broke it!

 
Header photo courtesy of: http://www.flickr.com/photos/tmartin/ / CC BY-NC 2.0