Python algorithm to sort large blocks of data

3.5k views Asked by At

I've been looking online for a method to sort the kind of data I have (LDIF files), but I haven't found quite what I'm after. There are already programs out there to accomplish this sorting, but they fail with extremely large data sets. Well, for me extremely large is around 2 GB worth of these blocks, and this exhausts memory when using the ldifsort.pl script even if I have 6 GB RAM available and several more GB of swap. So I'm hoping to write a program that will store the blocks of data to the hard drive, sort the keys in memory, and then reassemble the blocks in sorted order. And I'd like to use python3 as I'm trying to learn that language. So if anyone has suggestions on basic strategy or on specific ways to do this with python3, I'd really appreciate the help.

I have large text files containing LDAP data, basically in the (much simplified) form of:

dn: [email protected];RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: [email protected];Subscriber=UniqueName1;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

dn: [email protected];RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: [email protected];Subscriber=UniqueName2;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

Each subscriber has three more blocks that are associated with it (my example code only shows one other block associated with the subscriber), and I would like to keep all four blocks together after the sorting is complete.

So if I read the dn's in this order (the data associated with the dn's is hidden for brevity):

dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName4;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName3;RestOfTree=node
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node

I would want the output to be:

dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName3;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName4;RestOfTree=node

One thought I had was to use sqlite3 to store the data as python reads it, then sort the keys in python, then use queries to extract the data again from sqlite and write the data to a file. But I'm afraid the time spent searching for the keys in sqlite will be excessive. Then I thought I could sort the data in sqlite while I'm inserting the data, but it seems that sqlite doesn't support this, and I don't know if there is another database system that does.

Any help or direction is appreciated.

Thanks to Zach for his suggestion of just using GNU sort instead of a database system. Here is the solution I developed with his help.

awk -f ldifformatter.awk LDAP-data-files*.ldif | sort -t \| -k1 | sed '1d;s/|/\n/g' > sorted.txt

where ldifformatter.awk exchanges all newlines with "|" except for the top level dn's, which get used for sorting.

Thanks, Rusty

3

There are 3 answers

3
zwol On BEST ANSWER

The command-line sort utility can sort extremely large text files without reading them entirely into memory (at least, the GNU version can). However, to use it you will have to reformat the data so that each record (everything that should be kept together) appears on one line. If the records looked something like this:

dn: [email protected];RestOfTree=node1|groups: 1|permissions: 1|IsActive: FALSE|Barring: TRUE||dn: [email protected];Subscriber=UniqueName1;RestOfTree=node1|groups: 1|permissions: 1|ServiceProfile: Lemur

then sort -t \| -k1 will do the job.

You can write a program in Python that streams the data into a temporary file in the appropriate format, invokes sort using subprocess.check_call, and then restores the original format. Use tmpfile.NamedTemporaryFile to create the temporary file.

0
oefe On

I wonder if SQLite is really not up to the task. But anyway, you could use an external sorting algorithm, e.g. Mergesort, to keep the memory usage low.

http://en.wikipedia.org/wiki/External_sorting

0
Leonid Shvechikov On

You should not sort your data in memory. You could use merge sort.

Guido van Rossum wrote an article about the same problem — Sorting a million 32-bit integers in 2MB of RAM using Python. There are code examples in this article.