Python algorithm to sort large blocks of data

Question

Python algorithm to sort large blocks of data

3.5k views Asked by Rusty Lemur At 28 August 2013 at 15:34

I've been looking online for a method to sort the kind of data I have (LDIF files), but I haven't found quite what I'm after. There are already programs out there to accomplish this sorting, but they fail with extremely large data sets. Well, for me extremely large is around 2 GB worth of these blocks, and this exhausts memory when using the ldifsort.pl script even if I have 6 GB RAM available and several more GB of swap. So I'm hoping to write a program that will store the blocks of data to the hard drive, sort the keys in memory, and then reassemble the blocks in sorted order. And I'd like to use python3 as I'm trying to learn that language. So if anyone has suggestions on basic strategy or on specific ways to do this with python3, I'd really appreciate the help.

I have large text files containing LDAP data, basically in the (much simplified) form of:

dn: [email protected];RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: [email protected];Subscriber=UniqueName1;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

dn: [email protected];RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: [email protected];Subscriber=UniqueName2;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

Each subscriber has three more blocks that are associated with it (my example code only shows one other block associated with the subscriber), and I would like to keep all four blocks together after the sorting is complete.

So if I read the dn's in this order (the data associated with the dn's is hidden for brevity):

dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName4;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName3;RestOfTree=node
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node

I would want the output to be:

dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName3;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName4;RestOfTree=node

One thought I had was to use sqlite3 to store the data as python reads it, then sort the keys in python, then use queries to extract the data again from sqlite and write the data to a file. But I'm afraid the time spent searching for the keys in sqlite will be excessive. Then I thought I could sort the data in sqlite while I'm inserting the data, but it seems that sqlite doesn't support this, and I don't know if there is another database system that does.

Any help or direction is appreciated.

Thanks to Zach for his suggestion of just using GNU sort instead of a database system. Here is the solution I developed with his help.

awk -f ldifformatter.awk LDAP-data-files*.ldif | sort -t \| -k1 | sed '1d;s/|/\n/g' > sorted.txt

where ldifformatter.awk exchanges all newlines with "|" except for the top level dn's, which get used for sorting.

Thanks, Rusty

Original Q&A

There are 3 answers

oefe On 28 August 2013 at 15:47

I wonder if SQLite is really not up to the task. But anyway, you could use an external sorting algorithm, e.g. Mergesort, to keep the memory usage low.

http://en.wikipedia.org/wiki/External_sorting

Leonid Shvechikov On 28 August 2013 at 15:54

You should not sort your data in memory. You could use merge sort.

Guido van Rossum wrote an article about the same problem — Sorting a million 32-bit integers in 2MB of RAM using Python. There are code examples in this article.

**zwol** · Accepted Answer · 2013-08-28T15:48:11+00:00

The command-line sort utility can sort extremely large text files without reading them entirely into memory (at least, the GNU version can). However, to use it you will have to reformat the data so that each record (everything that should be kept together) appears on one line. If the records looked something like this:

dn: [email protected];RestOfTree=node1|groups: 1|permissions: 1|IsActive: FALSE|Barring: TRUE||dn: [email protected];Subscriber=UniqueName1;RestOfTree=node1|groups: 1|permissions: 1|ServiceProfile: Lemur

then sort -t \| -k1 will do the job.

You can write a program in Python that streams the data into a temporary file in the appropriate format, invokes sort using subprocess.check_call, and then restores the original format. Use tmpfile.NamedTemporaryFile to create the temporary file.

TechQA.

Python algorithm to sort large blocks of data

There are 3 answers

Related Questions in PYTHON

Related Questions in DATABASE

Related Questions in SQLITE

Related Questions in SORTING

Related Questions in LDIF

Popular Questions

Popular Tags

Trending Questions