Using k-means clustering on web log data

3.3k views Asked by At

I have a data set from a access web log file which I'm interested in finding similar clusters. (I'm an absolute beginner of data mining). So far I have referred many research papers on the same problem domain.

An Efficient Approach for Clustering Web Access Patterns from Web Logs

Classifying the user intent of web queries using k-means clustering

I want to use k-means clustering to cluster web pages. Although these papers discuss about the algorithm, they do not specify the way of providing input data set. k-means calculate similarity between data points using Euclidean distance. So how to normalize my datset to be mined using k-means since urls can not directly used for k-means. Any help/good reference on this?

Example Dataset(p1..pn are different web pages)

p1,p2,p3,p4
p1,p2
p1,p5,p6,p7
p1,p2,p3,p5
1

There are 1 answers

1
michal hlavac On

If I understand it correctly, you have set of webpages and for each of them set of users that have visited them. Eg (0 - not visited, 1 - visited):

WebPage  | user1 | user2 | user3
webpage1 |     0 |     1 |     1
webpage2 |     1 |     0 |     0
webpage3 |     0 |     1 |     1

and you want do find distances between these web pages.

Then Euclidean distance between two web pages is:
d(wp1, wp2) = sqrt((u1,1 - u2,1)^2 + (u1,2 - u2,2)^2 + ... + (u1,n - u2,n)^2),
where u2,1 = 1 if user1 has visited webpage2, otherwise 0

so d(webpage1, webpage2) = sqrt((0 - 1)^2 + (1 - 0)^2 + (1 - 0)^2) = sqrt(3)
d(webpage1, webpage3) = sqrt(0+0+0) = 0 = zero distance between them

you also can for each user take the number of particular web page visits into account, than instead of 1 for visited webpage use weighted values: numberOfParticularWebPageVisits/numberOFAllUsersWebpageVisits