How to sort a variable-length string array with radix sort?

2.7k views Asked by At

I know that radix sort can sort same-length string arrays, but is it possible to do so with variable-length strings. If it is, what is the C-family code or pseudo-code to implement this?

It might not a be fast algorithm for variable-length strings, but it is easy to implement radix sort, so it's useful if a sort needs to be coded quickly.

2

There are 2 answers

1
caustik On BEST ANSWER

I'm not quite sure what you mean by "variable-length strings" but you can perform a binary MSB radix sort in-place so the length of the string doesn't matter since there are no intermediate buckets.

#include <stdio.h>
#include <algorithm>

static void display(char *str, int *data, int size)
{
    printf("%s: ", str);

    for(int v=0;v<size;v++) {
        printf("%d ", data[v]);
    }

    printf("\n");
}

static void sort(int *data, int size, int bit)
{
    if (bit == 0)
        return;

    int b = 0;
    int e = size;

    if (size > 0) {
        while (b != e) {
            if (data[b] & (1 << bit)) {
                std::swap(data[b], data[--e]);
            }
            else {
                b++;
            }
        }

        sort(data, e, bit - 1);
        sort(data + b, size - b, bit - 1);
    }
}

int main()
{
    int data[] = { 13, 12, 22, 20, 3, 4, 14, 92, 11 };
    int size = sizeof(data) / sizeof(data[0]);

    display("Before", data, size);
    sort(data, size, sizeof(int)*8 - 1);
    display("After", data, size);
}

0
Mischa On

You can do a MSB-first radix sort on variable-length strings. There are a couple non-obvious details:

Pass #N will partition (scatter) strings from the input vector into 256 partitions, according to strvec[i][N]. It then will scan the partitions in order, and put (reinsert) strings back into the input vector.

Now the slightly complicated bit...

When you reach the end of a string, it is in its final position, and should never be touched again. That splits the strings before and after it into separate RANGES. The result of each pass is a set of ranges of yet-unsorted rows.

That means that pass #N, after the first, scans the strings in each range, and stores the source range id (index) along with the string, in the partition. In the "reinsert" step, it puts the string back into its source range; and again, it generates a new set of unsorted-row ranges.

You keep the stable-sort bonus of radix sort, if you forward-scan the input ranges and then backward-scan the partitions and reinsert starting at the back of each source range.

You can also use recursion (doing a complete sort from scratch on any subrange) but the above saves on setup and is faster.

There are more details ... quicksort falls through to doing an insertion sort for tiny ranges (e.g. up to 16); radix sort benefits from doing the same. Using multiple bytes as a partition index is possible. One approach for that is in: Radix Sort-Mischa Sandberg-2010 There are other approaches. Sorry I can't post code; it's now proprietary.