Transforming less frequent values

54 views Asked by At

Suppose I have the following columns for a csv that I read through a 'File Reader' node:

id, name, city, income

After reading it, I notice that the column 'city' contains a huge number of unique values. I want to:

  1. Know which values are the 'k' most frequent for 'city'
  2. Modify those which are not the 'k' most frequent to hold something like 'other'

Example:

id, name, city, income
1, Person 1, New York, 100.000 
2, Person 2, Toronto, 90.000
3, Person 3, New York, 50.000
4, Person 4, Seattle, 60.000

Choosing k to be 1, I want to produce the following table:

id, name, city, income
1, Person 1, New York, 100.000 
2, Person 2, Other, 90.000
3, Person 3, New York, 50.000
4, Person 4, Other, 60.000

It happens because 'New York' is the '1' most frequent value for 'city' in the original table.

Do you know how I can do that using Knime?

Thanks a lot!

1

There are 1 answers

0
Gábor Bakos On BEST ANSWER

You can use the CSV Reader to read the data. With the Statistics and Row Filter nodes you can find the k most frequent values. From those, you can create a collection cell using GroupBy. With that collection value, you can use Rule Engine with a similar ruleset:

$city$ IN $most frequent cities$ => $city$
TRUE => "Other"