Splitting large file in two while keeping header

508 views Asked by At

I have a very large text file (ca. 1.8TB) that I need to split at a certain entry. I know which line this entry is on, but I can also identify it via a grep command. I only care about the part of the file from this entry on.

I saw that certain Unix commands like csplit would do just that. However, the file also has an important header (30 lines long), and it is important that the newly created file(s) would also contain this header. As there's no way to prepend to files, I'm kind of stumped how to do this. Csplit and split don't seem to have the option to append their output to an existing file, and I think the file is too large for me to edit it with a text editor.

I would appreciate any advice!

2

There are 2 answers

0
Aditya On BEST ANSWER

I tested these commands on a file with 10 million lines and I hope that you will find them useful.

Extract the header (the first 30 lines of your file) into a separate file, header.txt:

perl -ne 'print; exit if $. == 30' 1.8TB.txt > header.txt

Now you can edit the file header.txt in order to add an empty line or two at its end, as a visual separator between it and the rest of the file.

Now copy your huge file from the 5 millionth line and up to the end of the file – into the new file 0.9TB.txt. Instead of the number 5000000, enter here the number of the line you want to start copying the file from, as you say that you know it:

perl -ne 'print if $. >= 5000000' 1.8TB.txt > 0.9TB.txt

Be patient, it can take a while. You can launch 'top' command to see what's going on. You can also track the growing file with tail -f 0.9TB.txt

Now merge the header.txt and 0.9TB.txt:

perl -ne 'print' header.txt 0.9TB.txt > header_and_0.9TB.txt

Let me know if this solution worked for you.

Edit: The steps 2 and 3 can be combined into one:

perl -ne 'print if $. >= 5000000' 1.8TB.txt >> header.txt
mv header.txt 0.9TB.txt

Edit 26.05.21: I tested this solution with split and it was magnitudes faster:

If you dont have perl, use head to extract the header:

head -n30 1.8TB.txt > header.txt

split -l 5000030 1.8TB.txt 0.9TB.txt

(Note the file with the extention *.txtab, created by split)

cat 0.9TB.txtab >> header.txt

mv header.txt header_and_0.9TB.txt
0
dcc310 On

split does indeed have a way to append to files (at least in the versions I have). You likely want the --filter argument which enables pretty complicated things.

Suppose I have this file foo.csv:

header
data1
data2
data3
data4
data5

This code would split the file into files with at most 2 lines, and keep the header for each file.

# Export is important, since we aren't using double quotes for the filter arg
# Using double quotes would break the $FILE part, which is a special word for split
export CSV=foo.csv
N_PER_FILE=2

# Create header files for each final file we'll have
# Use -n +2 for tail, so we work with the same line count as later.
# Note: piping into `head` discards the actual data. I just don't want
# to calculate on my own the names and numbers of the files that will be created.
split --verbose -l $N_PER_FILE --filter 'head -n 1 $CSV > $FILE' <(tail -n +2 $CSV) $CSV

# Append the non-header parts to each file
split --verbose -l $N_PER_FILE --filter 'cat - >> $FILE' <(tail -n +2 $CSV) $CSV

# Recreating the orig file from splits. -q to tail doesn't print filenames
cat <(head -n 1 $CSV) <(tail -q -n +2 ${CSV}??) > reconstructed.csv

# To confirm the split/reconstruction is all good
head *.csv?? reconstructed.csv

You can maybe change the 2s to be 30s to work with your 30 line header somehow, though I didn't write this for your case specifically.

Perhaps that Bash is a little advanced for some people, but the main idea for appending to files would be: --filter 'cat - >> $FILE', which just cats STDIN to the usual file that split would create. Other fun possibilities are things like --filter 'gzip > $FILE.gz' to get gzipped parts right away.

(If the <(stuff) sytnax is new to you, that is called "process substitution" if you need to search it!)