How to count distance to the previous zero in pandas series?

2.2k views Asked by At

I have the following pandas series (represented as a list):

[7,2,0,3,4,2,5,0,3,4]

I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:

[1,2,0,1,2,3,4,0,1,2]

How to do it in pandas in the most efficient way?

8

There are 8 answers

0
Ami Tavory On

It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:

cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret

cdef int i, zero_count = 0
for i in range(len(ret)):
    zero_count = 0 if arr_view[i] == 0 else zero_count + 1
    ret_view[i] = zero_count

Note the use of typed memory views, which are extremely fast. You can speed it further using @cython.boundscheck(False) decorating a function using this.

2
Alex Riley On

A solution in Pandas is a little bit tricky, but could look like this (s is your Series):

>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0    1
1    2
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
dtype: int64

For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.

0
behzad.nouri On

The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:

>>> izero = np.r_[-1, (ts == 0).nonzero()[0]]  # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
0
dimid On

Another option

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]

def d0(a):
    return np.min(a[a>=0])
    
df.index.to_series().apply(lambda i: d0(i - zeros))

Or using pure numpy

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]

np.min(a, where=a>=0, axis=1, initial=len(df))
0
ali bakhtiari On

Maybe pandas is not the best tool for this as in the answer by @behzad.nouri, however here is another variation:

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})

z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()

0    1
1    2
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
Name: X, dtype: int64

Solution 2:

If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()

0    0
1    1
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
dtype: int64

This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:

x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)

0    1
1    2
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
dtype: int64
0
Bill On

Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.

import numpy as np

# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1

# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)

# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]

print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
0
rhug123 On

Here is a way without using groupby:

((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())

Output:

[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
0
Partha Mandal On

A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:


df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df

df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']

df