Identity mask for numpy ndarray

390 views Asked by At

I would have expected True to preserve a ndarray when used as a mask, however, it adds a dimension, just like None.

arr = np.arange(16).reshape(2, 4, 2)
np.all(arr[True] == arr)         # outputs: True

Close enough, however looking closer:

arr[True].shape                  # outputs: (1, 2, 4, 2)
arr[None].shape                  # outputs: (1, 2, 4, 2)

I found two ways to set an identity mask: using slice(None) or Ellipsis.

np.all(arr[slice(None)] == arr)  # outputs: True
arr[slice(None)].shape           # outputs: (2, 4, 2)

np.all(Ellipsis == arr)          # outputs: True
arr[Ellipsis].shape              # outputs: (2, 4, 2)

Nothing really surprising here as this is how slicing works in the first place. slice(None) is a tad ugly and Ellipsis seems a wee bit faster.
However, going through:

I am not sure I fully understand this:

Deprecated since version 1.15.0: In order to remain backward compatible with a common usage in Numeric, basic slicing is also initiated if the selection object is any non-ndarray and non-tuple sequence (such as a list) containing slice objects, the Ellipsis object, or the newaxis object, but not for integer arrays or other embedded sequences.

I understand that the best way to preserve an array is not to mask it, but say I really want to setup a default value for a mask... ;-)

Question: Which is the preferred way to setup an identity mask ? And if I may, is True adding a dimension the intended behavior ?

2

There are 2 answers

1
user2357112 On

You keep saying "mask", but it doesn't sound like you really want a masking operation at all, even an "identity" mask. A mask array would typically be a boolean array of the same shape as the original array, and indexing with the mask would produce a 1D array with items selected by the mask. Even an all-true mask would produce a flattened copy of the array it was applied to. It wouldn't be an identity operation. It's possible to do weirder things with masks, but not an identity operation.

If you want an indexer that outputs an equivalent array to the original, the typical, most general way to do that would be ... - a literal ellipsis:

arr[...]

Unlike :, this also works for 0-dimensional arrays. Note that this produces a view, not a copy. There is no indexer that would produce a copy and work properly for all input dimensions.


arr[True] works like it does primarily out of a desire to have 0-dimensional arrays follow the same boolean indexing rules as positive-dimensional arrays. As mentioned above, if you index an n-dimensional array with an n-dimensional mask, the result is a 1-dimensional array. If you index a 0-dimensional array with a 0-dimensional mask, the result is again a 1-dimensional array:

In [1]: import numpy

In [2]: x = numpy.array([[1, 2], [3, 4]])

In [3]: x[x % 2 == 0]
Out[3]: array([2, 4])

In [4]: y = numpy.array([1, 2, 3, 4])

In [5]: y[y % 2 == 0]
Out[5]: array([2, 4])

In [6]: z = numpy.array(5) # 0-dimensional!

In [7]: z[z % 2 == 0]
Out[7]: array([], dtype=int64)

In [8]: z[z % 2 == 1]
Out[8]: array([5])

Indexing a 0-dimensional array with a 0-dimensional mask increases the dimensionality by 1. Generalized to higher dimensions, indexing an n-dimensional array with a 0-dimensional mask produces an n+1-dimensional array. If the mask is True, the extra dimension has length 1; if the mask is False, the extra dimension has length 0, and the output has no elements. This generalized behavior is rarely useful, but it's what fits best with the (rarely useful) rules for applying a positive-dimension mask to an array with mismatching dimensions.

2
hpaulj On

For a sample 2d array:

In [172]: x=np.array([[1,2],[4,3]])
In [173]: x.__array_interface__
Out[173]: 
{'data': (50806320, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (2, 2),
 'version': 3}

A view with ellipsis:

In [174]: x[...].__array_interface__
Out[174]: 
{'data': (50806320, False),          # same as for x
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (2, 2),
 'version': 3}

A view with an added dimension:

In [175]: x[None].__array_interface__
Out[175]: 
{'data': (50806320, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (1, 2, 2),
 'version': 3}

A copy with an added dimension - note the change data address. Advanced indexing.

In [176]: x[True].__array_interface__
Out[176]: 
{'data': (50796640, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (1, 2, 2),
 'version': 3}

Another copy with a size 0 dimension. It's reusing memory.

In [177]: x[False].__array_interface__
Out[177]: 
{'data': (50796640, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (0, 2, 2),
 'version': 3}

The only applicable reference in the indexing page that I can find is:

https://numpy.org/doc/stable/reference/arrays.indexing.html#detailed-notes

the nonzero equivalence for Boolean arrays does not hold for zero dimensional boolean arrays.

I wouldn't be surprised if this behavior was a left over from some past implementation. Due a history of merging several numeric packages, there are some rough edges. Some of those have been, or are in the process of, deprecation.

A scalar boolean index is a zero dimensional boolean array:

In [178]: np.array(True).shape
Out[178]: ()

We can add the new dimension else where:

In [181]: x[:,True].shape
Out[181]: (2, 1, 2)
In [183]: x[...,False].shape
Out[183]: (2, 2, 0)