Migrating python2 mixed-type np.array operations to python3

72 views Asked by At

I'm migrating from python2 to python3 and I'm facing an issue which I have simplified to this:

import numpy as np
a = np.array([1, 2, None])
(a > 0).nonzero()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: '>' not supported between instances of 'NoneType' and 'int' 

In reality I'm processing np-arrays with millions of data and really need to keep the np-operation for performance. In python 2 this was working fine and returns what I expect, since python2 is not so keen on types. What is the best approach for migrating this?

2

There are 2 answers

3
CDJB On BEST ANSWER

One way to achieve the desired result is to use a lambda function with np.vectorize:

>>> a = np.array([1, 2, None, 4, -1])
>>> f = np.vectorize(lambda t: t and t>0)
>>> np.where(f(a))
(array([0, 1, 3], dtype=int64),)

Of course, if the array doesn't contain negative integers, you could just use np.where(a), as both None and 0 would evaluate to False:

>>> a = np.array([1, 2, None, 4, 0])
>>> np.where(a)
(array([0, 1, 3], dtype=int64),)

Another way this can be solved is by first converting the array to use the float dtype, which has the effect of converting None to np.nan. Then np.where(a>0) can be used as normal.

>>> a = np.array([1, 2, None, 4, -1])
>>> np.where(a.astype(float) > 0)
(array([0, 1, 3], dtype=int64),)

Time comparison:

enter image description here

So Bob's approach, while not as easy on the eyes, is about twice as fast as the np.vectorise approach, and slightly slower than the float conversion approach.

Code to reproduce:

import perfplot
import numpy as np

f = np.vectorize(lambda t: t and t>0)

choices = list(range(-10,11)) + [None]

def cdjb(arr):
    return np.where(f(arr))

def cdjb2(arr):
    return np.where(arr.astype(float) > 0)

def Bob(arr):
    deep_copy = np.copy(arr)
    deep_copy[deep_copy == None] = 0
    return (deep_copy > 0).nonzero()[0]

perfplot.show(
    setup=lambda n: np.random.choice(choices, size=n),
    n_range=[2**k for k in range(25)],
    kernels=[
        cdjb, cdjb2, Bob
        ],
    xlabel='len(a)',
    )
0
Bob On

To conclude, with the help of @CDJB and @DeepSpace, the best solution I found is to replace the None values with a value suitable for the specific operation. Also included deep copy of array for not messing up the original data.

import numpy as np
a = np.array([1, None, 2, None])
deep_copy = np.copy(a)
deep_copy[deep_copy == None] = 0
result = (deep_copy > 0).nonzero()[0]
print(result)
[0 2]