How to align two unequal sized timeseries numpy array?

6.3k views Asked by At

I have two numpy arrays containing timeseries (unix timestamps).
I want to find pairs of timestamps (1 from each array) whose difference is within a threshold.

For achieving this, I need to align two of the time series data into two arrays, such that each index has its closest pair. (In case of two timestamps in arrays equally close to another timestamp in another array, I don't mind choosing either one, as the count of pairs is more important than the actual values.)

So the aligned data set will have two arrays of same size, plus a smaller array being filled with empty data .

I was thinking of using timeseries package and the align function.
But am not sure how to use aligned for my data which is a timeseries.

Example consider two timeseries arrays:

ts1=np.array([ 1311242821.0, 1311242882.0, 1311244025.0, 1311244145.0, 1311251330.0, 
               1311282555.0, 1311282614.0])
ts2=np.array([ 1311226761.0, 1311227001.0, 1311257033.0, 1311257094.0, 1311281265.0])

Output sample:

Now for ts2[2] (1311257033.0), its closest pair should be ts1[4] (1311251330.0) because the difference is 5703.0, which is within the threshold, and it is the smallest. Now that ts2[2] and ts1[4] are already paired they should be left out of other calculations.

Such Pairs should be found, so the Output array might be longer than the actual arrays

abs(ts1[0]-ts2[0]) = 16060
abs(ts1[0]-ts2[1]) = 15820 //pair
abs(ts1[0]-ts2[2]) = 14212
abs(ts1[0]-ts2[3]) = 14273
abs(ts1[0]-ts2[4]) = 38444


abs(ts1[1]-ts2[0]) = 16121
abs(ts1[1]-ts2[1]) = 15881
abs(ts1[1]-ts2[2]) = 14151
abs(ts1[1]-ts2[3]) = 14212
abs(ts1[1]-ts2[4]) = 38383


abs(ts1[2]-ts2[0]) = 17264
abs(ts1[2]-ts2[1]) = 17024
abs(ts1[2]-ts2[2]) = 13008
abs(ts1[2]-ts2[3]) = 13069
abs(ts1[2]-ts2[4]) = 37240


abs(ts1[3]-ts2[0]) = 17384
abs(ts1[3]-ts2[1]) = 17144
abs(ts1[3]-ts2[2]) = 12888
abs(ts1[3]-ts2[3]) = 17144
abs(ts1[3]-ts2[4]) = 37120


abs(ts1[4]-ts2[0]) = 24569
abs(ts1[4]-ts2[1]) = 24329
abs(ts1[4]-ts2[2]) = 5703 //pair
abs(ts1[4]-ts2[3]) = 5764
abs(ts1[4]-ts2[4]) = 29935


abs(ts1[5]-ts2[0]) = 55794
abs(ts1[5]-ts2[1]) = 55554
abs(ts1[5]-ts2[2]) = 25522
abs(ts1[5]-ts2[3]) = 25461
abs(ts1[5]-ts2[4]) = 1290 //pair


abs(ts1[6]-ts2[0]) = 55853
abs(ts1[6]-ts2[1]) = 55613
abs(ts1[6]-ts2[2]) = 25581
abs(ts1[6]-ts2[3]) = 25520
abs(ts1[6]-ts2[4]) = 1349


So the pairs are: (ts1[0],ts2[1]), (ts1[4],ts2[2]), (ts1[5],ts2[4])
The rest of elements should have null as their pair
The final two arrays will be of size 9.

Please let me know if this question is clear.

6

There are 6 answers

0
Schuh On

I don't know what you mean with aligning timestamps. But you can use the time module to represent timestamps as floats or integers. In a first step you can convert any userformat to an array defined by time.struct_time. In a second step you can convert this to seconds form start of the epoche. Then you have integervalues to do calculations with the timestamps.

How to convert user format using time.strptime() is well explained in the docs:

    >>> import time
    >>> t = time.strptime("30 Nov 00", "%d %b %y")
    >>> t
    time.struct_time(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0,
             tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
    >>> time.mktime(t)
    975538800.0
1
stovfl On

Solution using numpy Mask arrays output aligned Timeseries(_ts1, _ts2).
The Result are 3 Pairs and only Pairs with Distance 1 can be used to align the Timeseries therfore Threshold=1.

def compute_diffs(threshold):
    dtype = [('diff', int), ('ts1', int), ('ts2', int), ('threshold', int)]
    diffs = np.empty((ts1.shape[0], ts2.shape[0]), dtype=dtype)
    pairs = np.ma.make_mask_none(diffs.shape)

    for i1, t1 in enumerate(ts1):
        for i2, t2 in enumerate(ts2):
            diffs[i1, i2] = (abs(t1 - t2), i1, i2, abs(i1-i2))

        d1 = diffs[i1][diffs[i1]['threshold'] == threshold]
        if d1.size == 1:
            (diff, y, x, t) = d1[0]
            pairs[y, x] = True
    return diffs, pairs

def align_timeseries(diffs):
    def _sync(ts, ts1, ts2, i1, i2, ii):
        while i1 < i2:
            ts1[ii] = ts[i1]; i1 +=1
            ts2[ii] = DTNULL
            ii += 1
        return ii, i1

    _ts1 = np.array([DTNULL]*9)
    _ts2 = np.copy(_ts1)
    ii = _i1 = _i2 = 0

    for n, (diff, i1, i2, t) in enumerate(np.sort(diffs, order='ts1')):
        ii, _i1 = _sync(ts1, _ts1, _ts2, _i1, i1, ii)
        ii, _i2 = _sync(ts2, _ts2, _ts1, _i2, i2, ii)

        if _i1 == i1:
            _ts1[ii] = ts1[i1]; _i1 += 1
            _ts2[ii] = ts2[i2]; _i2 += 1
            ii += 1

    ii, _i1 = _sync(ts1, _ts1, _ts2, _i1, ts1.size, ii)
    return _ts1, _ts2

main:

diffs, pairs = compute_diffs(threshold=1)
print('diffs[pairs]:{}'.format(diffs[pairs]))
_ts1, _ts2 = align_timeseries(diffs[pairs])
pprint(ts1, ts2, _ts1, _ts2)

Output:

diffs[pairs]:[(15820, 0, 1) ( 5703, 4, 2) ( 1290, 5, 4)]
           ts1                  ts2                    _ts1          diff          _ts2
0: 2011-07-21 12:07:01  2011-07-21 07:39:21     ---- -- -- -- -- --  ----   2011-07-21 07:39:21
1: 2011-07-21 12:08:02  2011-07-21 07:43:21     2011-07-21 12:07:01 15820   2011-07-21 07:43:21
2: 2011-07-21 12:27:05  2011-07-21 16:03:53     2011-07-21 12:08:02  ----   ---- -- -- -- -- --
3: 2011-07-21 12:29:05  2011-07-21 16:04:54     2011-07-21 12:27:05  ----   ---- -- -- -- -- --
4: 2011-07-21 14:28:50  2011-07-21 22:47:45     2011-07-21 12:29:05  ----   ---- -- -- -- -- --
5: 2011-07-21 23:09:15  ---- -- -- -- -- --     2011-07-21 14:28:50  5703   2011-07-21 16:03:53
6: 2011-07-21 23:10:14  ---- -- -- -- -- --     ---- -- -- -- -- --  ----   2011-07-21 16:04:54
7: ---- -- -- -- -- --  ---- -- -- -- -- --     2011-07-21 23:09:15  1290   2011-07-21 22:47:45
8: ---- -- -- -- -- --  ---- -- -- -- -- --     2011-07-21 23:10:14  ----   ---- -- -- -- -- --

Tested with Python: 3.4.2

0
chapelo On

You have two sorted lists of timestamps, and you need to merge them into one, keeping the elements of each list separate from each other, calculating the difference when there is a switch or change of list.

My first solution is without using numpy, and consists in 1) adding to each element the id of the list it belongs to, 2) sort by timestamp, 3) group by list id, 4) construct the new list separating each element and calculating the difference when needed:

import numpy as np
from itertools import groupby
from operator import itemgetter

ts1 = np.array([1311242821.0, 1311242882.0, 1311244025.0, 1311244145.0, 1311251330.0, 1311282555.0, 1311282614.0])               
ts2 = np.array([1311226761.0, 1311227001.0, 1311257033.0, 1311257094.0, 1311281265.0])

def without_numpy():
    # 1) Add the list id to each element
    all_ts = [(_, 0) for _ in ts1] + [(_, 1) for _ in ts2] 
    merged_ts = [[], [], []]
    # 2) Sort by timestamp and 3) Group by list id
    groups = groupby(sorted(all_ts), key=itemgetter(1))
    # 4) Construct the new list
    diff = False
    for key, g in groups:
        group = list(g)
        ### See Note    
        for ts, k in group:
            if diff:
                merged_ts[key] = merged_ts[key][:-1]
                merged_ts[2][-1] = abs(end - ts)
                diff = False
            else:
                merged_ts[not key].append(None)
                merged_ts[2].append(None)
            merged_ts[key].append(ts)
        end = ts
        diff = True
    return merged_ts

Using numpy, the procedure is a little different, and consists in 1) adding to each element the id of the list it belongs to and some helper indices, 2) sort by timestamp, 3) flag each switch or change of list, 4) do a sum scan of the previous flags, 5) calculate the proper index of each element in the merged list:

import numpy as np

ts1 = np.array([1311242821.0, 1311242882.0, 1311244025.0, 1311244145.0, 1311251330.0, 1311282555.0, 1311282614.0])
ts2 = np.array([1311226761.0, 1311227001.0, 1311257033.0, 1311257094.0, 1311281265.0])

    def with_numpy(): 
        dt = np.dtype([('ts', np.float), ('key', np.int), ('idx', np.int)])

        all_ts = np.sort(
                np.array(
                    [(_, 0, 1, 0) for _ in ts1] + [(_, 1, 1, 0) for _ in ts2], 
                    dtype=np.dtype([('ts', np.float), 
                                    ('key', np.int),    # list id
                                    ('index', np.int),  # index in result list
                                    ('single', np.int), # flag groups with only one element
                                   ])
                ), 
                order='ts'
            )


        #### See NOTE

        sh_dn = np.roll(all_ts, 1)

        all_ts['index'] = np.add.accumulate(all_ts['index']) - np.cumsum(
                            np.not_equal(all_ts['key'], sh_dn['key']))

        merged_ts = np.full(shape=(3, all_ts['index'][-1]+1), fill_value=np.nan)
        merged_ts[all_ts['key'], all_ts['index']] = all_ts['ts']
        merged_ts[2] = np.abs(merged_ts[0] - merged_ts[1])
        merged_ts = np.delete(merged_ts, -1, axis=1)
        merged_ts = np.transpose(merged_ts)

        return merged_ts

Both functions, with or without numpy produce the same result. The printing and formatting could be done as needed. Which function is better depends on the data you have at hand.

NOTE: In the case that there is a switch to the other list and after only one value there is a switch back to the previous list, the functions as they are above will only keep the last difference, perhaps losing a smaller difference. In this case you could insert the following sections in the place where the "#### See Note" is:

For the without_numpy function, insert:

if len(group) == 1:
    group.append(group[0])

For the with_numpy function, insert:

sh_dn = np.roll(all_ts, 1)
sh_up = np.roll(all_ts, -1)
all_ts['single'] = np.logical_and( np.not_equal(all_ts['key'], sh_dn['key']), 
                                           np.equal(sh_dn['key'], sh_up['key']))
singles = np.where(all_ts['single']==1)[0]
all_ts = np.insert(all_ts, singles, all_ts[singles])
1
stefan On

I'm not sure if I got your question right. If so and assuming that your data is already sorted you can do that in one pass when using iterators. simply adapt the example to your needs.


EDIT: @jointful found a bug - when catching the StopIteration one value may be "processed", it is neither output nor left in the iterators. Here is the fixed ode.

left = iter(range(15, 60, 3))
right = iter(range(0, 50, 5))

try:
    i = next(left)
    j = next(right)
    while True:
        if abs(i-j) < 1:
            print("pair", i, j)
            i = j = None
            i = next(left)
            j = next(right)
        elif i <= j:
            print("left", i, None)
            i = None
            i = next(left)
        else:
            print("right", None, j)
            j = None
            j = next(right)
except StopIteration:
    pass
# one iterator might have succeeded
if i is not None:
    print("left", i, None)
if j is not None:
    print("right", None, j)
# one of the iterators may have leftover elements
for i in left:
    print("left", i, None)
for j in right:
    print("right", None, j)

        

prints

right None 0
right None 5
right None 10
pair 15 15
left 18 None
right None 20
left 21 None
left 24 None
right None 25
left 27 None
pair 30 30
left 33 None
right None 35
left 36 None
left 39 None
right None 40
left 42 None
pair 45 45
left 48 None
left 51 None
left 54 None
left 57 None                                                                                                                                                                      
0
entrophy On

Apart from the small mistakes in the question, I could guess what the problem was really about.

What you are looking at is a classic example of the Assignment Problem. Scipy provides you with an implementation of Hungarian algorithm, check the doc here. It doesnt have to be timestamps, can be any number (integer or float).

The below snippet will work with 2 numpy arrays of different sizes along with a threshold to give you either the cost array (filtered by threshold) or the index pairs corresponding to the 2 numpy arrays (again, pairs who's cost is filtered by the threshold).

The comments will walk you through, using the given timestamp arrays as example.

import numpy as np
from scipy.optimize import linear_sum_assignment


def closest_pairs(inp1, inp2, threshold=np.inf):
    cost = np.zeros((inp1.shape[0], inp2.shape[0]), dtype=np.int64)

    for x in range(ts1.shape[0]):
        for y in range(ts2.shape[0]):
            cost[x][y] = abs(ts1[x] - ts2[y])

    print(cost)
    # cost for the above example:
    # [[16060 15820 14212 14273 38444]
    # [16121 15881 14151 14212 38383]
    # [17264 17024 13008 13069 37240]
    # [17384 17144 12888 12949 37120]
    # [24569 24329  5703  5764 29935]
    # [55794 55554 25522 25461  1290]
    # [55853 55613 25581 25520  1349]]

    # hungarian algorithm implementation provided by scipy
    row_ind, col_ind = linear_sum_assignment(cost)
    # row_ind = [0 1 3 4 5], col_ind = [1 0 3 2 4] 
    # where (ts1[5] - ts2[4]) = 1290

    # if you want the distances only
    out = [item 
           for item in cost[row_ind, col_ind] 
           if item < threshold]

    # if you want the pair of indices filtered by the threshold
    pairs = [(row, col) 
             for row, col in zip(row_ind, col_ind) 
             if cost[row, col] < threshold]

    return out, pairs


if __name__ == '__main__':
    out, pairs = closest_pairs(ts1, ts2, 6000)
    print(out, pairs)
    # out = [5703, 1290] 
    # pairs = [(4, 2), (5, 4)]

    out, pairs = closest_pairs(ts1, ts2)
    print(out, pairs)
    # out = [15820, 16121, 12949, 5703, 1290] 
    # pairs = [(0, 1), (1, 0), (3, 3), (4, 2), (5, 4)]
2
Nuageux On

In order to have your time series pairs, I'll advise you to first compute your pairs of indices (get_pairs). And then to compute your pair of time series (get_tspairs).

In get_pairs, I first compute a matrix m which represents the difference of each points between the two time series. So the shape's matrix is (len(ts1), len(ts2)). Then I choose the smallest distance among all. In order not to choose several times the same index, I set to np.inf the distance for the chosen indices. I continue this process till we can not choose more tuple of indices. If the minimal distance is above the threshold, the process is broken.

Once I got my pairs of indices, I call get_tspairs in order to generate pairs of time series. The first step here is to join the time series with the selected tuple of indices, then to add the indices that were not chosen and associate them with None (equivalent of NULL in Python).

Which gives:

import numpy as np
import operator

ts1=np.array([ 1311242821.0, 1311242882.0, 1311244025.0, 1311244145.0, 1311251330.0, 
               1311282555.0, 1311282614.0])
ts2=np.array([ 1311226761.0, 1311227001.0, 1311257033.0, 1311257094.0, 1311281265.0])

def get_pairs(ts1, ts2, threshold=np.inf):
    m = np.abs(np.subtract.outer(ts1, ts2))
    indices = []
    while np.ma.masked_invalid(m).sum() != 'masked':
        ind = np.unravel_index(np.argmin(m), m.shape)
        if m[ind] < threshold:
            indices.append(ind)
            m[:,ind[1]] = np.inf
            m[ind[0],:] = np.inf
        else:
            m= np.inf
    return indices

def get_tspairs(pairs, ts1, ts2):
    ts_pairs = [(ts1[p[0]], ts2[p[1]]) for p in pairs]
    
    # We separate the selected indices from ts1 and ts2, then sort them
    ind_ts1 = sorted(map(operator.itemgetter(0), pairs))
    ind_ts2 = sorted(map(operator.itemgetter(1), pairs))
    
    # We only keep the non-selected indices
    l1 = np.delete(np.arange(len(ts1), dtype=np.int64), ind_ts1)
    l2 = np.delete(np.arange(len(ts2), dtype=np.int64), ind_ts2)
    
    ts_pairs.extend([(ts1[i], None) for i in l1])
    ts_pairs.extend([(ts2[i], None) for i in l2])
    
    return ts_pairs
    
if __name__ == '__main__':
    pairs = get_pairs(ts1, ts2)
    print(pairs)
    # [(5, 4), (4, 2), (3, 3), (0, 1), (1, 0)]
    
    ts_pairs = get_tspairs(pairs, ts1, ts2)
    print(ts_pairs)
    # [(1311282555.0, 1311281265.0), (1311251330.0, 1311257033.0), (1311244145.0, 1311257094.0), (1311242821.0, 1311227001.0), (1311242882.0, 1311226761.0), (1311244025.0, None), (1311282614.0, None), (1311257033.0, None)]