How to transform and calculate with dates as strings who are in an array?

86 views Asked by At

I have the following dataset:

df = pd.DataFrame([
['B2', 'G2',[1291593600000000000, 1394755200000000000, 1397347200000000000,
 1506816000000000000, 1509494400000000000, None]],
['B10', 'G10',[1291593600000000000, 1394755200000000000, 1460505600000000000,
 1506816000000000000]], 
['B14', 'G14',[1291593600000000000, 1394755200000000000, 1460505600000000000,
 1506816000000000000]]], 
columns= ['Baum2', 'Baum7', 'value_pair'])

The values in value_pair are dates in unix time.

What I want to do: I want to check if the difference between two specific dates in each row (let's say the third minus the second entry in each array) is more than 70 days. If that is true I want to delete the row.

I want to do this same operation in each row (which I grouped beforehand) in the column value_pair.

The Problem:

I can't calculate with the dates in unix time and later convert them with pd.to_datetime() to my desired format (as far as I know). Subtracting works but converting them doesn't work: <class 'numpy.ndarray'> is not convertible to datetime

Second approach:

Before subtracting the dates from each other, I put them in my desired format beforehand:

#df['value_pair'] = pd.to_datetime(df['value_pair'])
#df['value_pair'] = df['value_pair'].dt.strftime('%Y-%m-%d')

The Problem:

The problem is now that after the following line I get this error: TypeError: unsupported operand type(s) for -: 'numpy.str_' and 'numpy.str_'

erg1 = df['value_pair'][0][2]-df['value_pair'][0][1]

Makes sense because I can't subtract strings from each other like that.

Right here, I am out of ideas. Does anyone know a different approach to this problem?

My code:

import pandas as pd

df = pd.DataFrame([
['B2', 'G2',[1291593600000000000, 1394755200000000000, 1397347200000000000,
 1506816000000000000, 1509494400000000000, None]],
['B10', 'G10',[1291593600000000000, 1394755200000000000, 1460505600000000000,
 1506816000000000000]], 
['B14', 'G14',[1291593600000000000, 1394755200000000000, 1460505600000000000,
 1506816000000000000]]], columns= ['Baum2', 'Baum7', 'value_pair'])

df['value_pair'] = pd.to_datetime(df['value_pair'])
df['value_pair'] = df['value_pair'].dt.strftime('%Y-%m-%d')

erg1 = df['value_pair'][0][2]-df['value_pair'][0][1]

print(df)
3

There are 3 answers

0
Suraj Shourie On

Firstly check Convert unix time to readable date in pandas dataframe. pd.to_datetime(df['date'],unit='s') is the solution they use.

You can do the same using datetime

import datetime
datetime.datetime.fromtimestamp(129159360)

Output: datetime.datetime(1974, 2, 3, 16, 36)

But this will not work for your value_pairs as it has a lot of extra trailing zeros.

For example with your first value_pair entry is 1291593600000000000, based on the trailing zeroes used you get different dates:

print(datetime.datetime.fromtimestamp(12915936))
print(datetime.datetime.fromtimestamp(129159360))
print(datetime.datetime.fromtimestamp(1291593600))
print(datetime.datetime.fromtimestamp(12915936000))

Output:

1970-05-30 07:45:36
1974-02-03 16:36:00
2010-12-05 19:00:00
2379-04-16 20:00:00

Any higher trailing zeroes will give you an error. So you'll have to clean your data first and then you can convert the timestamp and get the time difference/delta

2
carraro On

Firstly, you need to convert the UNIX timestamps in the value_pair to datetime format, and after this, calculate the difference between the third and the second entry for each list in the value_pair column. Now you can filter the rows where's the difference is the value that you need to filter:

import pandas as pd

df = pd.DataFrame([
    ['B2', 'G2',[1291593600000000000, 1394755200000000000, 1397347200000000000,
      1506816000000000000, 1509494400000000000, None]],
    ['B10', 'G10',[1291593600000000000, 1394755200000000000, 1460505600000000000,
      1506816000000000000]], 
    ['B14', 'G14',[1291593600000000000, 1394755200000000000, 1460505600000000000,
      1506816000000000000]]], columns= ['Baum2', 'Baum7', 'value_pair'])

def convert_to_datetime(value_list):
    value_list = [pd.NaT if v is None else pd.to_datetime(v) for v in value_list]
    return value_list

df['value_pair'] = df['value_pair'].apply(convert_to_datetime)

def calc_diff(value_list):
    try:
        return value_list[2] - value_list[1]
    except:
        return pd.NaT

df['diff'] = df['value_pair'].apply(calc_diff)

df = df[df['diff'].dt.days <= 70]

def format_datetime(value_list):
    return [v.strftime('%Y-%m-%d %H:%M:%S') if pd.notna(v) else None for v in value_list]

df['value_pair'] = df['value_pair'].apply(format_datetime)

print(df)

Edit: I added a function for format the datetime.

1
Timeless On

A possible solution :

L, R = 3, 2 # 3rd and 2nd timestamps

def diff(lst, gap=70):
    lts, rts = lst[L-1], lst[R-1]
    try:
        return (lts - rts).days < gap if (lts and rts) else False
    except IndexError:
        return False

# is the difference of days, less than the gap (e.g 70) ?
m = [diff(list(map(lambda x: pd.to_datetime(x), lst))) for lst in df["value_pair"]]
# [True, False, False] with intermediates : [30, 761, 761]

out = df.loc[m]

Output :

Baum2 Baum7 value_pair
0 B2 G2 [1291593600000000000, 1394755200000000000, 1397347200000000000, 1506816000000000000, 1509494400000000000, None]