Splitting large files in half

Asked by At

tl;dr: I need a way of splitting 5 GB / ~11m row files in ~half (or thirds) while keeping track of exactly every file I create and of course not breaking any lines, so I can process both files at once

I have a set of 300 very large json-like files I need to parse with a php script periodically. Each file is about 5 GB decompressed. I've optimized the hell out of parsing script and it's reached it's speed limit. But it's still a single-threaded script running for about 20 hours on a 16 core server.

I'd like to split each file into approximately half, and have two parsing scripts run at once, to "fake" multi-threaded-ness and speed up run time. I can store global runtime information and "messages" between threads in my sql database. That should cut the total runtime in half, having one thread downloading the files, another decompressing them, and two more loading them into sql in parallel.

That part is actually pretty straight forward, where I'm stuck is splitting up the file to be parsed. I know there is a split tool that can break down files into chunks based on KB or line count. Problem is that doesn't quite work for me. I need to split these files in half (or thirds or quarters) cleanly. And without having any excess data go into an extra file. I need to know exactly what files the split command has created so I can note easy file in my sql table so the parsing script can know which files are ready to be parsed. If possible, I'd even like to avoid running wc -l in this process. That may not be possible, but it takes about 7 seconds for each file, 200 files, means 35 extra minutes of runtime.

Despite what I just said, I guess I run wc -l file on my file, divide that by n, round the result up, and use split to break the file into that many lines. That should always give me exactly n files. Than I can just know that ill have filea, fileb and so on.

I guess the question ultimately is, is there a better way to deal with this problem? Maybe theres another utility that will split in a way thats more compatible with what I'm doing. Or maybe there's another approach entirely that I'm overlooking.

1 Answers

1
user2203703 On

I had the same problem and it wasn't easy to find a solution.

First you need to use jq to convert your JSON to string format.

Then use the GNU version of split, it has an extra --filter option which allows processing individual chunks of data in much less space as it does not need to create any temporary files:

split --filter='shell_command'

Your filter command should read from stdin:

jq -r '' file.json | split -l 10000 --filter='php process.php'

-l will tell split to work on 10000 lines at a time.

In process.php file you just need to read from stdin and do whatever you want.