Python's `collections.Counter` with non-integer values

664 views Asked by At

collections.Counter is extremely useful to gather counts and operate on counts as objects. You can do things like:

>>> Counter({'a': 2, 'b': 5}) + Counter({'a': 3, 'c': 7})
Counter({'c': 7, 'a': 5, 'b': 5})

This essentially groups items by their keys and sums the values of each group.

What is a minimum-code way to reuse the Counter functionality with non-integer values?

These values would have addition defined to be the group values "reducing" operation I want already: For example strings and lists (which both have __add__ defined to be concatenation).

As is, we get:

>>> Counter({'a': 'hello ', 'b': 'B'}) + Counter({'a': 'world', 'c': 'C'})
Traceback (most recent call last):
    ...
TypeError: '>' not supported between instances of 'str' and 'int'
>>> Counter({'a': [1, 2], 'b': [3, 4]}) + Counter({'a': [1, 1], 'c': [5, 6, 7]})
Traceback (most recent call last):
    ...
TypeError: '>' not supported between instances of 'list' and 'int'

In the code of collections.Counter, there is a hard-coded assumption that values are integers, and therefore things like self.get(k, 0) and count > 0 littered all over the place. It seems therefore that subclassing Counter wouldn't be much less work than rewriting my own specialized (or general) custom class (probably using collections.defaultdict).

Instead, it seems like wrapping the values (e.g. str and list) to be able to operate with 0 as if it were an empty element might be the elegant approach.

2

There are 2 answers

0
thorwhalen On

I'd propose two solutions:

One wrapping the values themselves, though only the examples in the question are assured to covered here: Other Counter operations are not:

>>> class ZeroAsEmptyMixin:
...     _empty_val = None
...
...     def __gt__(self, other):
...         if isinstance(other, int) and other == 0:
...             other = self._empty_val
...         return super().__gt__(other)
...
...     def __add__(self, other):
...         if isinstance(other, int) and other == 0:
...             other = self._empty_val
...         return self.__class__(super().__add__(other))
...
...
>>> class mystr(ZeroAsEmptyMixin, str):
...     _empty_val = str()
...
...
>>> class mylist(ZeroAsEmptyMixin, list):
...     _empty_val = list()
...
...
>>>
>>> Counter({'a': mystr('hello '), 'b': mystr('B')}) + Counter({'a': mystr('world'), 'c': mystr('C')})
Counter({'a': 'hello world', 'c': 'C', 'b': 'B'})
>>> Counter({'a': mylist([1, 2]), 'b': mylist([3, 4])}) + Counter({'a': mylist([1, 1]), 'c': mylist([5, 6, 7])})
Counter({'c': [5, 6, 7], 'b': [3, 4], 'a': [1, 2, 1, 1]})

The other by writing a "Counter-like" custom class, subclassing defaultdict, and again, only implementing the __add__ method.

>>> from collections import defaultdict
>>> from functools import wraps
>>>
>>> class MapReducer(defaultdict):
...     @wraps(defaultdict.__init__)
...     def __init__(self, *args, **kwargs):
...         super().__init__(*args, **kwargs)
...         self._empty_val = self.default_factory()
...
...     def __add__(self, other):
...         if not isinstance(other, self.__class__):
...             return NotImplemented
...         result = self.__class__(self.default_factory)
...         for elem, val in self.items():
...             newval = val + other[elem]
...             if newval != self._empty_val:
...                 result[elem] = newval
...         for elem, val in other.items():
...             if elem not in self and val != self._empty_val:
...                 result[elem] = val
...         return result
...
...
>>>
>>> strmp = lambda x: MapReducer(str, x)
>>> strmp({'a': 'hello ', 'b': 'B'}) + strmp({'a': 'world', 'c': 'C'})
MapReducer(<class 'str'>, {'a': 'hello world', 'b': 'B', 'c': 'C'})
>>> listmp = lambda x: MapReducer(list, x)
>>> listmp({'a': [1, 2], 'b': [3, 4]}) + listmp({'a': [1, 1], 'c': [5, 6, 7]})
MapReducer(<class 'list'>, {'a': [1, 2, 1, 1], 'b': [3, 4], 'c': [5, 6, 7]})
0
ShadowRanger On

I wouldn't call "defining addition to integers for collection types" more elegant than just rewriting Counter to do the work you need. You need non-counting related behaviors, you need a class that isn't focused on counting things.

Fundamentally, Counter is inappropriate to your use case; you don't have counts. What does elements mean when you lack a count to multiply each key by? most_common might work as written, but it would have nothing to do with frequencies.

In 95% of cases, I'd just use collections.defaultdict(list) (or whatever default is appropriate), and in the other 5%, I'd use Counter as a model and implement my own version (without count-specific behaviors).