Content Classification from URL

1.9k views Asked by At

Given n number of raw URLs, I'd like to be able to classify them by: news, blog, photo and video.

An example would be if a link directs a user to a photo, would it be enough to say that the raw link contains file extension for images to be able to classify the raw URL as photo?

As for video, blog and news, it seems it isn't enough to have a set of domains (like http://www.youtube.com) that will classify the raw URLs.

Could classification be done by examining the web content? Or are there any open source tools for this?

1

There are 1 answers

0
Mike Christian On

The only URLs that may be even somewhat reliably classified, are those that point to a distinct medium (i.e. http://foo.com/foo.jpg is most certainly an image). Otherwise, you must analyze the content of the page.

This can be a bit tricky, as Flash may contain a photo, video, or neither, without providing any searchable clue as to the content of the flash object. With enough effort, this can obviously be overcome (Google does it!), but I'm not aware of any open source resources that provide a library of media-related domains. Such data result from countless programmer-hours of effort -- an effort that typically seeks a return on investment (ROI). Case in point, ClueWeb09 is just a dataset of downloaded pages, used to test search algorithms -- not really sorted or categorized.

"Sometimes no help is the answer."