I have a UDF that accepts a bag as input and converts it to a map. Each key of the map consists of the distinct elements in the bag and the values corresponding to their count
But it's failing the junit tests
I have a UDF that accepts a bag as input and converts it to a map. Each key of the map consists of the distinct elements in the bag and the values corresponding to their count
But it's failing the junit tests
The output of 1
is correct: in your UDF you are counting the number of tuples that have the same value for the first field, but in the test you are adding only one tuple with two values.
If what you want is to count the number of tuples with the same value as "key" (where key is the first value in your tuple), then what you are doing is correct, but you would have to change your test:
public void testExecWithSimpleMap() throws Exception {
Tuple inputTuple = tupleFactory.newTuple(1);
DataBag dataBag = bagFactory.newDefaultBag();
Tuple nestedTuple = tupleFactory.newTuple(2);
nestedTuple.set(0, "12345");
nestedTuple.set(1, "another value");
dataBag.add(nestedTuple);
// Add a second tuple
nestedTuple.set(0, "12345");
nestedTuple.set(1, "and another value");
dataBag.add(nestedTuple);
inputTuple.set(0,dataBag);
Map output = testClass.exec(inputTuple);
assertEquals(output.size(), 1);
System.out.println(output.get("12345"));
assertEquals(output.get("12345"),2);
}
However, if what you wanted was to count how many times a value is repeated in the whole Bag, whether it's on different Tuples or on the same Tuple, (it is not very clear in your question), then you need to change your UDF:
public class BagToMap extends EvalFunc<Map> {
public Map exec(Tuple input) throws IOException {
if(input == null) {
return null;
}
DataBag values = (DataBag)input.get(0);
Map<String, Integer> m = new HashMap<String, Integer>();
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = it.next();
// Iterate through the Tuple as well
for (Iterator<Object> it2 = t.iterator(); it2.hasNext();) {
Object o = it2.next();
String key = o.toString();
if(m.containsKey(key)) {
m.put(key, m.get(key)+1);
} else {
m.put(key, 1);
}
}
}
return m;
}
}
In this case, your test should pass.
It seems you except to use a bag of two tuples, but you are indeed creating a bag that contains a tuple with two fields.
This code :
should be transformed to :
Or you may iterate on all your tuple's fields.