I have a very large text file (ca. 1.8TB) that I need to split at a certain entry. I know which line this entry is on, but I can also identify it via a grep command. I only care about the part of the file from this entry on.
I saw that certain Unix commands like csplit would do just that. However, the file also has an important header (30 lines long), and it is important that the newly created file(s) would also contain this header. As there's no way to prepend to files, I'm kind of stumped how to do this. Csplit and split don't seem to have the option to append their output to an existing file, and I think the file is too large for me to edit it with a text editor.
I would appreciate any advice!
I tested these commands on a file with 10 million lines and I hope that you will find them useful.
Extract the header (the first 30 lines of your file) into a separate file,
header.txt
:Now you can edit the file
header.txt
in order to add an empty line or two at its end, as a visual separator between it and the rest of the file.Now copy your huge file from the 5 millionth line and up to the end of the file – into the new file
0.9TB.txt.
Instead of the number 5000000, enter here the number of the line you want to start copying the file from, as you say that you know it:Be patient, it can take a while. You can launch '
top
' command to see what's going on. You can also track the growing file withtail -f 0.9TB.txt
Now merge the
header.txt
and0.9TB.txt
:Let me know if this solution worked for you.
Edit: The steps 2 and 3 can be combined into one:
Edit 26.05.21: I tested this solution with
split
and it was magnitudes faster:If you dont have
perl
, usehead
to extract the header:(Note the file with the extention *.
txtab
, created bysplit
)