How to Optimize the code after creating map

75 views Asked by At

I want to make the second part work thanks for the help:)

If arraylist is ['A','a','B','C','C'] then dups:['A','a','C','C'] nondups: ['B'] I tried :

Map<String, Long> counts = nums.parallelStream()
  .collect( Collectors.groupingBy( {k -> k.toLowerCase()}, Collectors.counting()) )

It gives counts: {a:2, b:1, c:2}

Now I am finding dups and nondups which is done by below code or alternate suggestion to solve this:

 Map<Boolean, List<String>> hasDup =
            counts.entrySet().parallelStream()
              .collect(Collectors.partitioningBy(
                 entry -> entry.getValue() > 1,
                 Collectors.mapping(Map.Entry::getKey, Collectors.toList())));
List<String> dup = hasDup.get(true);
     List<String> nodup = hasDup.get(false);

dup:['a','c'] nondups:['b'] I need help in second part because the expected output is not the same as I want which is

dups:['A','a','C','C'] nondups: ['B']
2

There are 2 answers

3
Leo Tapia On

It seems you're mixing things, because the algorithm does not returns what you want even if it was a non-parallel stream. In the second part you are obtaining the key you've used to group due to duplication counting, which also is in lowercase. So, you won't retrieve the uppercase ones.

Then, what you really need is to "unroll" the dups as many times as the associated count.

List<Character> nums = List.of('A' , 'a' , 'B' , 'C' , 'C');
Map<String, Long> counts = nums.parallelStream()
    .map(String::valueOf)
    .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));

List<Character> dup = new ArrayList<>();
List<Character> nodup = new ArrayList<>();
nums.forEach(x -> (counts.get(x.toString().toLowerCase()) > 1 ? dup : nodup).add(x));

Then, since it iterates over nums to fill both dup and nodup lists, you have the original Characters.

0
Stephen C On

As Leo notes, the first part does not benefit from parallelism since the work nearly all happens in the collector. The same probably applies to the second part as well.

.... or alternate suggestion to solve this:

I think that the you will get significantly better performance if you don't use streams. Instead partition the input manually and give the partitions to N threads to populate a shared ConcurrentHashMap with key -> count entries.

Also, I think you need to rethink the way you are representing the outputs from this process. It strokes me that generating lists that contains multiple copies of the duplicates is unnecessary and inefficient. Even separating the duplicates from the non-duplicates seems unnecessary. Instead of generating lists, see if you can change the downstream code to stream data from the Map directly. Or push the data into concurrent queues for multiple downstream worker threads to consume.


Observation: in order to effectively exploit parallelism in a solution to a problem, it is often necessary to change the problem.