Sort pairs to be more consecutive

111 views Asked by At

I've got a set of paired items: [(a0,b0),(a1,b1),(a2,b2),...(aN,bN)]. These pairs form a set of runs. For instance, it may be the case that b4=a1 and b1=a6, so 4,1,6 is a run. All as, and all bs, are unique, so runs are never ambiguous. Note that some runs will be of length 1.

I'd like to permute this set such that runs are consecutive. In the permuted set, for any j, it should be the case either that aj=b(j+1), or that aj is not equal to any b. I don't care about any other aspect of the permutation, and the runs can appear in any order relative to each other.

For instance: the set [(5,2),(0,3),(1,0),(7,4),(6,5),(3,8)] contains the runs 4,0 ((6,5) and (5,2)), 2,1,5 ((1,0),(0,3), and (3,8)), and 3. So a valid permutation would be [(6,5),(5,2),(1,0),(0,3),(3,8),(7,4)].

Clearly there's a straightforward O(n log n) time and O(n) space solution:

  1. Sort the array of pairs by b.
  2. For each pair's a, perform a binary search over the array for a matching b. Record all as which do not match any b; these will be the run beginnings.
  3. Sort the array of pairs by a.
  4. For each run-beginning a, walk over that run, first by binary searching for the pair matching the a, then by binary searching for that pair's b among the as, and so on, until the run ends.

Asymptotically this is probably as good as we can do without hashing. From a practical standpoint, though, I'm not entirely happy. It involves two different sorts and 2*N binary searches. It's pretty darn branchy. This code is in an extreme inner loop (yes I've profiled) and will have a major impact on the performance of the system as a whole.

Do any alternative approaches spring to mind?

3

There are 3 answers

2
David Eisenstat On BEST ANSWER

We can do away with the binary searches at least. Decorate each pair with a pointer to a linked list node (letters in this example) and sort by second number and separately by first number.

[(5,2,a),(0,3,b),(1,0,c),(7,4,d),(6,5,e),(3,8,f)]  # unsorted
[(1,0,c),(5,2,a),(0,3,b),(7,4,d),(6,5,e),(3,8,f)]  # second number
[(0,3,b),(1,0,c),(3,8,f),(5,2,a),(6,5,e),(7,4,d)]  # first number

Now do a modified sorted merge.

(1,0,c) (0,3,b): case =; set c->next = b
(5,2,a) (1,0,c): case >; c is a list head
(5,2,a) (3,8,f): case <
(0,3,b) (3,8,f): case =; set b->next = f
(7,4,d) (5,2,a): case <
(6,5,e) (5,2,a): case =; set e->next = a
(3,8,f) (6,5,e): case >; e is a list head
(3,8,f) (7,4,d): case >; d is a list head

The linked list structure looks like

c->b->f
d
e->a,

which we can staple together.

0
AudioBubble On

If the indexes are taken from a limited set, there is an O(N) solution.

Using all the pairs, fill an array using Ai as the index and storing the corresponding Bi.

Then scan the array and just follow the runs.

To make sure that you visit every chain once and find the starts of the runs, you will add two flags telling 1) that the entry is or not the end of a pair, and 2) that the entry has already been consumed or not.

In the given case, the array is filled with

0:*3
1: 0
2:*-
3:*8
4:*-
5:*2
6: 5
7: 4
8:*-

The asterisk indicates that this entry cannot start a run.

The following runs are found: 1-0-3-8, 6-5-2, 7-4.

If the index range is too large to allow for an array, then a hash table is a good alternative.

0
Zim-Zam O'Pootertoot On

You can improve the average runtime without using too much extra space by using a hash table: add each tuple to a hash table using b as the key, then iterate through each tuple and see if its a matches a b that has been indexed in the table. This is pretty much identical to your algorithm, except that indexing the bs runs in expected linear time (rather than worst case n log n for a sort), and the binary search is replaced by an expcted constant time lookup.

I'm assuming that the reason that you're not using hashing is that the hash function is potentially expensive and hashing collisions will hurt performance. An alternative is to use something like a radix tree instead of a hash table, which will give more consistent results. Given that you've got large keys and not very many of them, you can use a large radix (e.g. 32 bits) to keep the constant factor small.