I'm making a big yahoo pipes project that takes DJ sets from various sources, filters them so the output only contains dj sets from the artists I filtered for and presents them in an RSS feed.
Since many sets are posted on multiple websites at the same time, but have a little variation in their titles, my feed often has duplicate items, despite using the unique filter.
I noticed most of these sets start with the dj name however. Only the last strings vary (sometimes a country name is added, or the date is displayed in a different format)
What I would like to do, is base the unique filter on the first word only. So if these 2 sets are added:
Dave Clarke – White Noise #471 – Best of 2014 (Electro Edition) – 11-Jan-2015
Dave Clarke – White Noise 471 (Best of 2014 Electro) – 12-JAN-2015
The unique filter would filter one of them out based on the first 2 words.
If I would only filter out based on the first 2 names, this would mean the unique filter would block out all future sets of this dj offcourse. To avoid this from happening, I would like to add some kind of formula that makes sure the pub date is also taken into consideration. Let's say I only want 1 item per dj per week.
I know this is rather complicated, but would it be possible?
Thanks!
I believe you could get pretty good result without considering the words but ngrams. Basically, rather than considering words, consider sequences of n characters (3 is probably a good number, but it's worth testing).
So, "Dave Clarke – White Noise #471 – Best of 2014 (Electro Edition) – 11-Jan-2015" would become a list like this:
["Dav", "ave", "ve ", "v C", " Cl", "Cla", ... "-20", "201" ,"015"]and "Dave Clarke – White Noise 471 (Best of 2014 Electro) – 12-JAN-2015" would give something like:
["Dav", "ave", "ve ", "v C", " Cl", "Cla", ..., "-20", "201" ,"015"]Once you have ngrams for each title, you can easily compare how many they have common... and the greater, the likelier that they're the same title.