Randomly sample lines retaining commented header lines

221 views Asked by At

I'm attempting to randomly sample lines from a (large) file, while always retaining a set of "header lines". Header lines are always at the top of the file and unlike any other lines, begin with a #.

The actual file format I'm dealing with is a VCF, but I've kept the question general

Requirements:

  • Output all header lines (identified by a # at line start)
  • The command / script should (have the option to) read from STDIN
  • The command / script should output to STDOUT

For example, consider the following sample file (file.in):

#blah de blah
1
2
3
4
5
6
7
8
9
10

An example output (file.out) would be:

#blah de blah
10
2
5
3
4

I have a working solution (in this case selecting 5 non-header lines at random) using bash. It is capable of reading from STDIN (I can cat the contents of file.in into the rest of the command) however it writes to a named file rather than STDOUT:

cat file.in | tee >(awk '$1 =~ /^#/' > file.out) | awk '$1 !~ /^#/' | shuf -n 5 >> file.out
1

There are 1 answers

8
123 On BEST ANSWER

By using process substitution (thanks Tom Fenech), both commands are seen as files.
Then using cat we can concatenate these "files" together and output to STDOUT.

cat <(awk '/^#/' file) <(awk '!/^#/' file | shuf -n 10)

Input

#blah de blah
1
2
3
4
5
6
7
8
9
10

Output

#blah de blah
1
9
8
4
7
2
3
10
6
5