Efficient way to get n middle lines from a very big file

7.7k views Asked by At

I have a big file around 60GB.

I need to get n middle lines of the file. I am using a command with head and tail like

tail -m file |head -n >output.txt

where m,n are numbers

The general structure of the file is like below with set of records (comma separated columns.) Each line can be of different length(say max 5000 chars).

col1,col2,col3,col4...col10

Is there any other way that I can take n middle lines with less time, because the current command is taking lot of time to execute?

6

There are 6 answers

1
MarcoS On

The only possible solution I can think of to speed up the search is to build and index of your lines, something like:

 0 00000000
 1 00000013
 2 00000045
   ...
 N 48579344

And then, knowing the index length, you could jump quickly in the middle of your data file (or wherever you like...). Of course you should keep the index updated when the file changes...

Obviously the canonical solution for such a problem would be to keep the data in a DB (see for example SQLite), an not in a plain file... :-)

1
perreal On

With sed you can at least remove the pipeline:

sed -n '600000,700000p' file > output.txt

will print lines 600000 through 700000.

1
Rajish On

It might be more efficient to use the split utility, because with tail and head in pipe you scan some parts of the file twice.

Example

split -l <k> <file> <prefix>

Where k is the number of lines you want to have in each file, and the (optional) prefix is added to each output file name.

0
bobah On

Open the file in the binary random access mode, seek to the middle, move forward sequentially till you reach \n or \n\r ascii, starting from the following character dump N lines to your rest file (one \n - one line). Job done.

If your file is sorted and you need data between two keys you use the above described method + bisection.

0
Anitha Mani On

awk 'FNR>=n && FNR<=m'

followed by name of the file.

0
Mihamina Rakotomandimby On

Having the same problem (mine is an Asterisk Master.csv file), I am affraid there is no trivial solution: when wanting to access the 10,000,000-th line of a file (file, not database record nor in memory representation of the file), whatever have to count from 0 to 10,000,000... :-(