tl;dr: I need a way of splitting 5 GB / ~11m row files in ~half (or thirds) while keeping track of exactly every file I create and of course not breaking any lines, so I can process both files at once
I have a set of 300 very large json-like files I need to parse with a php script periodically. Each file is about 5 GB decompressed. I've optimized the hell out of parsing script and it's reached it's speed limit. But it's still a single-threaded script running for about 20 hours on a 16 core server.
I'd like to split each file into approximately half, and have two parsing scripts run at once, to "fake" multi-threaded-ness and speed up run time. I can store global runtime information and "messages" between threads in my sql database. That should cut the total runtime in half, having one thread downloading the files, another decompressing them, and two more loading them into sql in parallel.
That part is actually pretty straight forward, where I'm stuck is splitting up the file to be parsed. I know there is a
split tool that can break down files into chunks based on KB or line count. Problem is that doesn't quite work for me. I need to split these files in half (or thirds or quarters) cleanly. And without having any excess data go into an extra file. I need to know exactly what files the
split command has created so I can note easy file in my sql table so the parsing script can know which files are ready to be parsed. If possible, I'd even like to avoid running
wc -l in this process. That may not be possible, but it takes about 7 seconds for each file, 200 files, means 35 extra minutes of runtime.
Despite what I just said, I guess I run
wc -l file on my file, divide that by n, round the result up, and use split to break the file into that many lines. That should always give me exactly n files. Than I can just know that ill have
fileb and so on.
I guess the question ultimately is, is there a better way to deal with this problem? Maybe theres another utility that will split in a way thats more compatible with what I'm doing. Or maybe there's another approach entirely that I'm overlooking.