Yahoo pipes: Unique first word in title only

Question

Yahoo pipes: Unique first word in title only

57 views Asked by Ittaidv At 12 January 2015 at 23:30

I'm making a big yahoo pipes project that takes DJ sets from various sources, filters them so the output only contains dj sets from the artists I filtered for and presents them in an RSS feed.

Since many sets are posted on multiple websites at the same time, but have a little variation in their titles, my feed often has duplicate items, despite using the unique filter.

I noticed most of these sets start with the dj name however. Only the last strings vary (sometimes a country name is added, or the date is displayed in a different format)

What I would like to do, is base the unique filter on the first word only. So if these 2 sets are added:

Dave Clarke – White Noise #471 – Best of 2014 (Electro Edition) – 11-Jan-2015

Dave Clarke – White Noise 471 (Best of 2014 Electro) – 12-JAN-2015

The unique filter would filter one of them out based on the first 2 words.

If I would only filter out based on the first 2 names, this would mean the unique filter would block out all future sets of this dj offcourse. To avoid this from happening, I would like to add some kind of formula that makes sure the pub date is also taken into consideration. Let's say I only want 1 item per dj per week.

I know this is rather complicated, but would it be possible?

Thanks!

Original Q&A

There are 2 answers

**Julien Genestoux** · Answer 1 · 2015-01-13T10:06:54+00:00

I believe you could get pretty good result without considering the words but ngrams. Basically, rather than considering words, consider sequences of n characters (3 is probably a good number, but it's worth testing).

So, "Dave Clarke – White Noise #471 – Best of 2014 (Electro Edition) – 11-Jan-2015" would become a list like this:

["Dav", "ave", "ve ", "v C", " Cl", "Cla", ... "-20", "201" ,"015"]

and "Dave Clarke – White Noise 471 (Best of 2014 Electro) – 12-JAN-2015" would give something like:

["Dav", "ave", "ve ", "v C", " Cl", "Cla", ..., "-20", "201" ,"015"]

Once you have ngrams for each title, you can easily compare how many they have common... and the greater, the likelier that they're the same title.

**Ittaidv** · Answer 2 · 2015-01-15T00:50:21+00:00

Is there a way to automate this in pipes? I've got a growing list of over 1000 keywords to handle and a growing list of 500 feeds as an input.

Ngrams look really nice, but it would be cool if there was some kind of tool that would allow me to break up the titles of the links into these ngrams so i can compare them :)

TechQA.

Yahoo pipes: Unique first word in title only

There are 2 answers

Related Questions in STRING

Related Questions in DATE

Related Questions in RSS

Related Questions in UNIQUE

Related Questions in YAHOO-PIPES

Popular Questions

Trending Questions