I have a data set from a access web log file which I'm interested in finding similar clusters. (I'm an absolute beginner of data mining). So far I have referred many research papers on the same problem domain.
An Efficient Approach for Clustering Web Access Patterns from Web Logs
Classifying the user intent of web queries using k-means clustering
I want to use k-means clustering to cluster web pages. Although these papers discuss about the algorithm, they do not specify the way of providing input data set. k-means
calculate similarity between data points using Euclidean distance. So how to normalize my datset to be mined using k-means
since urls can not directly used for k-means. Any help/good reference on this?
Example Dataset(p1..pn are different web pages)
p1,p2,p3,p4
p1,p2
p1,p5,p6,p7
p1,p2,p3,p5
If I understand it correctly, you have set of webpages and for each of them set of users that have visited them. Eg (0 - not visited, 1 - visited):
and you want do find distances between these web pages.
Then Euclidean distance between two web pages is:
d(wp1, wp2) = sqrt((u1,1 - u2,1)^2 + (u1,2 - u2,2)^2 + ... + (u1,n - u2,n)^2),
where u2,1 = 1 if user1 has visited webpage2, otherwise 0
so d(webpage1, webpage2) = sqrt((0 - 1)^2 + (1 - 0)^2 + (1 - 0)^2) = sqrt(3)
d(webpage1, webpage3) = sqrt(0+0+0) = 0 = zero distance between them
you also can for each user take the number of particular web page visits into account, than instead of 1 for visited webpage use weighted values: numberOfParticularWebPageVisits/numberOFAllUsersWebpageVisits