I am working on a keystroke biometric authentication project. It is like a wrapper over your traditional password-based authentication. If the password is right, it checks for the "typing-rhythm" and gives a positive output if it matches the user's profile. Else, a negative output is given. The "typing-rhythm" is checked by mapping some timing properties that are extracted while typing the password. There are essentially 5 features, namely- PP(Press-Press time), PR(Press-Release time), RP(Release-Press time), RR(Release-Release time) and Total time. PP is the time between pressing two consecutive keys(characters). RR is the time between releasing two consecutive keys. PR is the time for which the key was pressed and the released. RP is time between releasing a key and then pressing the next key. Total time is the time between pressing the first key of the password and releasing the last key of the password.
I'm using a open database GREYC-Web based KeyStroke dynamics for the project. Each session of data collection contains the ASCII value of the key pressed and the timestamp for PP, PR, RP, RR and total time. It also contains whether the actual user is typing the password or an impostor. While collecting the data, the users were allowed to use a password of their own. So naturally, there are passwords of varying length. Apart from that, a user might press extra keys(like Shift, Caps, Backspace, Delete, etc). For even for a particular user, different sessions of typing the password might have different password length. Note, password length in this context is the total number of keys(characters) that the user typed. So for example, if the user's actual password is "abcd". In one session he types it properly and the password length is 4. In another session he types the following set of keys- a, l, BACKSPACE, b,c,d- and thus the password length is 6.
Here is some context of the proposed system. The proposed system block-diagram is as follows. The "Input Feature Space Partition" creates subsets of the actual database to be fed to different classifiers namely-Gaussian, K-NN and OCSVM. The outputs of these classifiers is fed to a Back-Propogation Neural Network(BPNN) whose result is the final output. The BPNN is used to punish those classifiers that give wrong result and reward those classifiers that give correct result.
My question is how do I represent this varying length data in a structured format so that it can be processed and used in sci-kit learn.
I have looked into panda and numpy for pre-processing of data. But my problem precedes pre-processing stage.
Thanks, in advance!
An option would be a Recurrent Neural Network. These are networks effectively feed into themselves, effectively creating a function of time, or in your case, of relative position in a word. The structures of these networks are as follows:
The left part(before the arrow) shows the theoretical structure of an RNN. Values are passed not only between nodes in the network, but also between timesteps. This generalized structure allows for the embedding of arbitrary time, or in your case, arbitrary word length.
A common implementation of RNN's which are able to achieve even better results on some problems are LSTM's, or long short term memory networks.
To avoid an overly complex introductory answer, I will not go into too much detail on these. Baseically, they have more complex "hidden units", which facilitate more complex decisions about what data is kept, and what is "forgotten".
If you would like to implement these on your own, look into Tensorflow. If you have a library that you are more comfortable with, feel free to research its implementation of RNN's and LSTM's, but if not, Tensorflow is a great place to start.
Good luck in your research, and I hope this helps!