In KDD99 data set, a huge number of connections 32nd and 33rd feature’s value is greater than 100.
I can’t understand the reason why used a connection window
of 100 connections can get a value which is greater than 100? I consulted a lot of information, but found nothing.
The dataset contains 41 features for each connection.
These features were obtained preprocessing TCP dump files.
To do so, packet information in the TCP dump file was summarized into connections. Specifically (http://kdd.ics.uci.edu/databases/kddcup99/task.html):
Some of the features (the so called Time-based Traffic Features) were calculated over a 2-seconds temporal windows.
Other features (Host-based Traffic Features) using a historical window estimated over a number of connections (in this case 100).
Host-based features are useful for attacks which span intervals longer than 2 seconds.
2-seconds and 100-connections are somewhat arbitrary values.
The values of these two class of features haven't an upper limit (e.g. the number of connections to the same host over the 2-seconds interval can be greater than 100).
Same "should be" true for:
The problem is that there was no documentation explaining the details of KDD features extraction. The main reference is:
A Framework for Constructing Features and Models for Intrusion Detection Systems - WENKE LEE / SALVATORE J. STOLFO
from which it's clear that the bro-ids tools was used:
and
but this not enough.
Both
dst host count
anddst host srv count
are in the[0,255]
range.The AI-IDS/kdd99_feature_extractor project on Github can extract the 32nd and 33rd feature from raw data (take a look at the
stats*.cpp
files) but:Related questions on Stackoverflow are: