Cluster URLs based on their pattern using Python

275 views Asked by At

I am new to clustering techniques and I highly value any input you can provide for my problem bellow. Basically, I want to cluster URLs based on their structural patterns. for example

  • cluster1 - simple URLs https://domain/path/file
  • cluster2 - shortened URLs
  • cluster3 - redirect URLs
  • ....
  • cluster k - new URL pattern

Given a URL dataset, I want to understand how many different URL pattern clusters exists and then visually see the difference.

What I see in the existing methods are clustering domain wise (cluster URLs of the same website together). And this is not what I am expecting. When I try the nlp based (word based) similarity clustering this is happening as the URLs of the same website tend to have same words with little differences.

Instead, I want to focus on the URL structure and identify URL patterns. Removing all the special characters and just creating a bag of words for each URL nullify the URL structure. Can anyone help me to identify a suitable clustering technique as well as a vectorizing technique to identify different URL pattern clusters.

Thanks in advance Matheesha

1

There are 1 answers

0
ASH On

Here is an example of how to cluster text.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Result:

 - *eating:* climbing, eating
 - *google:* google, squooshy
 - *feedback:* feedback
 - *face:* face, map
 - *impressed:* impressed
 - *extension:* extension
 - *key:* belly, best, key, kitten, merley