Duke deduplication engine: can't find exact records

144 views Asked by At

I'm trying to create a configuration and processor for Duke to find exact matches in a record list. I created an ExactMatchComparator based processor but the function does not return exact matches. Here's here's the setup of the processor, configuration and listeners:

public void setup() {
  //setup
  List<Property> exactMatchProperties = new ArrayList<Property>();

  exactListener = new TestUtils.TestListener();

  ExactComparator exactComparator = new ExactComparator();

  //create properties (columns), with name, comparator, and high-low thresholds. ID property has no comparator or propabilities.

  exactMatchProperties.add(new PropertyImpl("ID"));
  exactMatchProperties.add(new PropertyImpl("NAME", exactComparator, 0.0, 1.0));
  exactMatchProperties.add(new PropertyImpl("EMAIL", exactComparator, 0.0, 1.0));

  //create new configuration implementation
  exactMatchConfig = new ConfigurationImpl();

  //add properties to config
  exactMatchConfig.setProperties(exactMatchProperties);
  exactMatchConfig.setThreshold(1.0);
  exactMatchConfig.setMaybeThreshold(0.0);

  //initialize the processor and add match listener
  exactMatchProcessor = new Processor(exactMatchConfig, true);
  exactMatchProcessor.addMatchListener(exactListener);
  }

And here's my function to test:

 public void testExactMatch() {
  Collection<Record> records = new ArrayList<Record>();
  Record rec1 = TestUtils.makeRecord(new String[] { "ID", "1", "NAME", "Jon", "EMAIL", "[email protected]" });
  Record rec2 = TestUtils.makeRecord(new String[] { "ID", "1", "NAME", "Jon", "EMAIL", "[email protected]" });

  records.add(rec1);
  records.add(rec2);

  exactMatchProcessor.deduplicate(records);
  System.out.println(exactListener.getMatches().size());
}

I'm using the API, and I've read the question on SO mentioned here, but that questions refers to XML, while I'm doing the tests in Java.

Shouldn't the getMatches not be empty? How can I get a list of duplicates found, or the opposite (list of unique records, without duplicates)? Thanks

0

There are 0 answers