Yahoo pipes: Unique first word in title only

57 views Asked by At

I'm making a big yahoo pipes project that takes DJ sets from various sources, filters them so the output only contains dj sets from the artists I filtered for and presents them in an RSS feed.

Since many sets are posted on multiple websites at the same time, but have a little variation in their titles, my feed often has duplicate items, despite using the unique filter.

I noticed most of these sets start with the dj name however. Only the last strings vary (sometimes a country name is added, or the date is displayed in a different format)

What I would like to do, is base the unique filter on the first word only. So if these 2 sets are added:

Dave Clarke – White Noise #471 – Best of 2014 (Electro Edition) – 11-Jan-2015

Dave Clarke – White Noise 471 (Best of 2014 Electro) – 12-JAN-2015

The unique filter would filter one of them out based on the first 2 words.

If I would only filter out based on the first 2 names, this would mean the unique filter would block out all future sets of this dj offcourse. To avoid this from happening, I would like to add some kind of formula that makes sure the pub date is also taken into consideration. Let's say I only want 1 item per dj per week.

I know this is rather complicated, but would it be possible?

Thanks!

2

There are 2 answers

0
Julien Genestoux On

I believe you could get pretty good result without considering the words but ngrams. Basically, rather than considering words, consider sequences of n characters (3 is probably a good number, but it's worth testing).

So, "Dave Clarke – White Noise #471 – Best of 2014 (Electro Edition) – 11-Jan-2015" would become a list like this:

["Dav", "ave", "ve ", "v C", " Cl", "Cla", ... "-20", "201" ,"015"]

and "Dave Clarke – White Noise 471 (Best of 2014 Electro) – 12-JAN-2015" would give something like:

["Dav", "ave", "ve ", "v C", " Cl", "Cla", ..., "-20", "201" ,"015"]

Once you have ngrams for each title, you can easily compare how many they have common... and the greater, the likelier that they're the same title.

0
Ittaidv On

Is there a way to automate this in pipes? I've got a growing list of over 1000 keywords to handle and a growing list of 500 feeds as an input.

Ngrams look really nice, but it would be cool if there was some kind of tool that would allow me to break up the titles of the links into these ngrams so i can compare them :)