I'm developing a Windows application that would allow the user to fully interact with his computer using a Kinect sensor. The user should be able to teach the application his own gestures and assign each one of them some Windows event. After the learning process, the application should detect user's movements, and when it recognizes a known gesture, the assigned event should be fired.
The crucial part is the custom gesture recognizer. Since the gestures are user-defined, the problem cannot be solved by hard-coding all the gestures directly into the application. I've read many articles discussing this problem, but none of them has given me the answer to my question: which algorithm is the best for learning and recognizing user-defined gestures?
I'm looking for algorithm that is:
- Highly flexible (the gestures can vary from simple hand gestures to whole body movements)
- Fast & effective (the application might be used with video games so we can't consume all of the CPU capacity)
- Doesn't require more than 10 repetitions when learning new gesture (repeating gesture more than 10 times to teach application is in my opinion not very user friendly)
- Easy to implement (preferably, I want to avoid struggling with two-page equations or so)
Note that the outcome does not have to be perfect. If the algorithm recognizes wrong gesture from time to time, it is more acceptable than if the algorithm runs slow.
I'm currently deciding between 3 approaches:
- Hidden Markov Models - these seem to be very popular when comes to gesture recognition, but also seem pretty hard to understand and implement. Besides, I'm not sure if HMM are suitable for what I'm trying to accomplish.
- Dynamic Time Warping - came across this site offering gesture recognition using DTW, but many users are complaining about the performance.
- I was thinking about adapting the $1 recognizer to 3D space and using movement of each joint as a single stroke. Then I would simply compare the strokes and pick the most similar gesture from the set of known gestures. But, in this case, I'm not sure about the performance of this algorithm, since there are many joints to compare and the recognition has to run in real-time.
Which of these approaches do you think is most suitable for what I'm trying to do? Or are there any other solutions to this problem? I would appreciate any piece of advice that could move me forward. Thank you.
(I'm using Kinect SDK.)