I've been looking online for a method to sort the kind of data I have (LDIF files), but I haven't found quite what I'm after. There are already programs out there to accomplish this sorting, but they fail with extremely large data sets. Well, for me extremely large is around 2 GB worth of these blocks, and this exhausts memory when using the ldifsort.pl script even if I have 6 GB RAM available and several more GB of swap. So I'm hoping to write a program that will store the blocks of data to the hard drive, sort the keys in memory, and then reassemble the blocks in sorted order. And I'd like to use python3 as I'm trying to learn that language. So if anyone has suggestions on basic strategy or on specific ways to do this with python3, I'd really appreciate the help.
I have large text files containing LDAP data, basically in the (much simplified) form of:
dn: [email protected];RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur
dn: [email protected];RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur
Each subscriber has three more blocks that are associated with it (my example code only shows one other block associated with the subscriber), and I would like to keep all four blocks together after the sorting is complete.
So if I read the dn's in this order (the data associated with the dn's is hidden for brevity):
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName4;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName3;RestOfTree=node
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node
I would want the output to be:
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName1;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName2;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName3;RestOfTree=node
dn: [email protected];RestOfTree=node
dn: [email protected];Subscriber=UniqueName4;RestOfTree=node
One thought I had was to use sqlite3 to store the data as python reads it, then sort the keys in python, then use queries to extract the data again from sqlite and write the data to a file. But I'm afraid the time spent searching for the keys in sqlite will be excessive. Then I thought I could sort the data in sqlite while I'm inserting the data, but it seems that sqlite doesn't support this, and I don't know if there is another database system that does.
Any help or direction is appreciated.
Thanks to Zach for his suggestion of just using GNU sort instead of a database system. Here is the solution I developed with his help.
awk -f ldifformatter.awk LDAP-data-files*.ldif | sort -t \| -k1 | sed '1d;s/|/\n/g' > sorted.txt
where ldifformatter.awk exchanges all newlines with "|" except for the top level dn's, which get used for sorting.
Thanks, Rusty
The command-line
sort
utility can sort extremely large text files without reading them entirely into memory (at least, the GNU version can). However, to use it you will have to reformat the data so that each record (everything that should be kept together) appears on one line. If the records looked something like this:then
sort -t \| -k1
will do the job.You can write a program in Python that streams the data into a temporary file in the appropriate format, invokes
sort
usingsubprocess.check_call
, and then restores the original format. Usetmpfile.NamedTemporaryFile
to create the temporary file.