The data-set looks like this:
Server IP Indicator Session_ID Time
2 1.20.54.221 A 00:00:01
2 1.20.54.221 A 00:01:00
1 1.20.54.221 Site A 00:02:00
1 1.20.54.221 B 00:05:00
2 1.20.54.221 Site B 00:08:00
2 1.20.54.221 C 00:10:00
2 1.20.54.221 C 00:15:00
1 1.20.54.221 F 01:00:00
1 1.20.54.221 F 01:05:00
2 1.20.54.221 Site F 01:08:00
The above data set is read from a log file. Session_id
will change when the server changes ( i.e. 1
-> 2
or 2
-> 1
). There are cases where the server changes right after a change occurred (i.e. 1
---->2
----->1
). Whenever a server is changed, the server would record the last Session_id
and return a new Session_id
at its second browse of the same server. ( e.g. 3rd observation : Session_id
still A
, and changed to B
for 4th observation) . If the server changes in the way (i.e. 1
---->2
----->1
---->1
), it will return A
-----> A
----->B
---> C
, in which B
is generated by server 2
and C
is generated by the second 1
My objective is to determine whether there is subgroup(s) among records. The rule is :
Given the same IP, If the time difference between the current record and last record does not exceed 30 minutes, then the record belongs to the same user.
I have an indicator variable site
to determine whether the site has changed. It is marked based on whether the change of server happens and the time difference is less than 30 mins.
Desired data set:
Server IP Indicator Session_ID Time Difference Last_site
2 1.20.54.221 A 00:00:01 . .
2 1.20.54.221 A 00:01:00 59s .
1 1.20.54.221 Site A 00:02:00 1 Min .
1 1.20.54.221 A 00:05:00 3 Min Site
2 1.20.54.221 Site A 00:08:00 3 Min Site
2 1.20.54.221 A 00:10:00 2 Min Site
2 1.20.54.221 A 00:15:00 5 Min Site
1 1.20.54.221 F 01:00:00 45min .
1 1.20.54.221 F 01:05:00 5 Min .
2 1.20.54.221 Site F 01:08:00 3 Min .
My implementation:
data log_file;
set log;
retain _Session_id Last_site;,
* Assign value to retain ;
if indicator = "Site" then _Session_id = Session_id;
* if last_site = Site, its value has to be changed;
last_site=lag(site);
* Record that should be in another group ;
if difference >30 then Last_Site = "";
* Replace;
if last_site not eq "" then session_id = _session_id
run;
And the problem is, the retained variable
will change from A
to B
at the fifth observation, while I would like to have it stick to the value A
until I found a record with time difference greater than 30
mins. ( the process would loop over hundred thousands of IP and therefore efficiency is also concerned.)
Is there a possible way to handle the problem elegantly? Thank you in advance.
[ Edited on 22/6]
I am thinking the possibility of using Hash Object to do the job. I wrote some codes (obviously they does not work and may incur syntax error ).
data test11a;
length mark $ 12 ip $ 32 Session_id $ 200 agent last_agent $ 200; <== the system said there is error with the type of variable and therefore I add this
Declare hash hid;
hid = _new_ hash( ordered : ' ascending');
hid.definekey('ip');
hid.definedata('Session_id');
hid.definedone();
call missing ( ip, Session_id);
do while ( not done);
set test11 end=done; <==== I have the original data-set called test11
by ip notsorted ; <==== hash object by group;
if not first.ip or not last.ip then do;
if mark = "Site" then rc= hid.add();
*if mark = "Next_Group" then hid.remove(key : ip); <=== Error
end;
if mark not eq "Site" or "Next_Group" then do;
rc=hid.find(); <==== Find matching iP and if matching ( rc=0)
* use variable _ session_id to test;
if rc = 0 then _session_id = Session_id;
end;
end;
run;
And the output data set has only two observations. Anyone could help to explain this?
This is a partial answer, as I could not figure out how you construct 'Last_site'. Seems to me what you want is to check whether the difference is over 30 mins/1800 sec, if 'no', the session_id stays the same, if 'yes', it adopts a new session_id. I may have oversimplified your problem, but the outcome seems close: