I'm attempting to randomly sample lines from a (large) file, while always retaining a set of "header lines". Header lines are always at the top of the file and unlike any other lines, begin with a #.
The actual file format I'm dealing with is a VCF, but I've kept the question general
Requirements:
- Output all header lines (identified by a # at line start)
- The command / script should (have the option to) read from STDIN
- The command / script should output to STDOUT
For example, consider the following sample file (file.in):
#blah de blah
1
2
3
4
5
6
7
8
9
10
An example output (file.out) would be:
#blah de blah
10
2
5
3
4
I have a working solution (in this case selecting 5 non-header lines at random) using bash. It is capable of reading from STDIN (I can cat the contents of file.in into the rest of the command) however it writes to a named file rather than STDOUT:
cat file.in | tee >(awk '$1 =~ /^#/' > file.out) | awk '$1 !~ /^#/' | shuf -n 5 >> file.out
By using process substitution (thanks Tom Fenech), both commands are seen as files.
Then using cat we can concatenate these "files" together and output to STDOUT.
Input
Output