I have a set of data that consists of seismic wave travel times and their corresponding information (i.e. source that produced the wave and the time for that wave arriving at each geophone along the spread). I am trying to format the data to fit my code in order to do some tomography using the data, but I'm still relatively new to awk. I am at a point where I need to now insert the number of receivers for each shot/source into the line of shot/source information, but its a variable amount each time. Is there a way to have awk count the number of rows and insert that into the proper field?

My data is formatted like the following.

Each line that documents a source/shot:

s 0.01 0 0 -1 0

Every other line that follows the source/shot information:

r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01

I can use the "s" as a flag for the shot lines, and I would like to count the number of "r" lines for each source/shot and insert that number into the corresponding "s" line.

The number of "r" lines for each "s" line varies greatly.

Given this sample input:

s 0.01 0 0 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 0 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01

The expected output is:

s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01

Note the 3 as $4 in the first s line and the 5 as $4 in the second one.

The counted number of rows should be in the 4th column of each "s" line (asterisks here).

My experience with awk is limited to just rearranging/indexing columns, so I don't really know where to begin with this. I've tried googling help with awk, but it's very difficult to find answered awk questions that actually pertain to my specific situation (hence why I have decided to ask it myself).

I'm also new to using stackoverflow, so if I need to include more example data, please let me know. My data consists of approximately 4000 lines.

EDIT: The reason the desired result has slightly different data to the example of my data is because there are hundreds of lines for each "s" line and including that in the question seems excessive. I have cut out the majority of the data for ease of reading.

6

There are 6 answers

3
jhnc On BEST ANSWER

A simple method is to read the file backwards.

  • whenever you see an r line, increment a counter
  • whenever you see an s line, substitute the counter and reset it

and then reverse the result:

tac input |
awk '
    /^r/ { n++ }
    /^s/ { $4=n; n=0 }
         { print }
' |
tac > output

You can read the file forwards but that involves maintaining state:

awk '
    /^s/ {
        # this prints the *previous* group of lines
        if (NR>1)
            print c1,c2,c3, n, c5,c6, r

        # save s columns, initialise n counter and r string
        c1=$1; c2=$2; c3=$3; n=0; c5=$5; c6=$6; r=""
    }
    /^r/ {
        n++
        r = r RS $0
    }
    END {
        # print final group
        print c1,c2,c3, n, c5,c6, r
    }
' input >output
0
dawg On

Here is a Ruby based on a multi-line regex:

ruby -e 'puts $<.read.scan(/(^s.*\R)((?:^r.*\R?)+)/).
    map{|s,r| n=r.split(/\R/).length; a=s.split; a[3]=n; "#{a.join(" ")}\n#{r}"}' file

Or, reverse the lines in memory and print at the end:

ruby -lane 'BEGIN{lines=[]}
lines<<$F
END{
    n=0
    puts lines.reverse.
         map{|l| if l[0]=="s" then l[3]=n; n=0 else n+=1 end; l.join(" ")}.
         reverse.join("\n")
}
' file

Or, parse the input into rolling blocks. The advantage here is only the relevant block has to be in memory:

ruby -lane 'BEGIN{
    lines=[]
    def print_block(block) = puts block.map{|l| l.join(" ")}.join"\n"
}
if $F[0]=="s" then
    print_block(lines) if lines.length>0
    $F[3]=$F[3].to_i
    lines=[$F]
else
    lines[0][3]+=1
    lines<<$F
end
END{print_block(lines)}
' file

Or you can use this GNU awk:

gawk '@include "join"
function p(){
    for(i=1;i<=length(lines); i++)
        print join(lines[i],1,length(lines[i])," ")
}

/^s/{
    if (lc>1) p()
    delete lines
    lc=1
    for (i=1;i<=NF;i++) lines[lc][i]=$i
}
/^r/{
    lc++
    for (i=1;i<=NF;i++) lines[lc][i]=$i
    lines[1][4]++
}
END{p()}' file 

Any of these prints:

s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
  
0
markp-fuso On

Undoing the desired updates from OP's expected output gives me the following input:

$ cat input.dat
s 0.01 0 0 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 0 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01

One awk idea:

awk '

function print_block(i) {
    if (s_line) {                       # if s_line not empty then ...
       sub(/COUNT/,cnt,s_line)          # replace "COUNT" with actual count and ...
       print s_line                     # print s line and ..
       for (i=1; i<=cnt; i++)
           print r_lines[i]             # r lines to stdout
    }
    cnt    = 0
    s_line = ""
    delete r_lines
}

$1 == "s" { print_block()               # print previous block of s/r lines
            $4     = "COUNT"            # replace 4th field with placeholder "COUNT"
            s_line = $0                 # save current s line
          }
$1 == "r" { r_lines[++cnt] = $0 }       # save r lines
END       { print_block() }             # flush last s/r block to stdout
' input.dat

This generates:

s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
4
Ed Morton On

Using any awk:

$ awk '
    /^s/ {
        if (NR>1) {
            prt()
        }
        cnt = 0
        shot = $0
        rs = ""
    }
    /^r/ {
        cnt++
        rs = rs $0 ORS
    }
    END { prt() }

    function prt(    orig) {
        orig = $0
        $0 = shot
        $4 = cnt+0
        print $0
        printf "%s", rs
        $0 = orig
    }
' file
s 0.01 0 3 -1 0
r 0.1 0 0 1.218 0.01
r 0.15 0 0 1.214 0.01
r 0.2 0 0 1.213 0.01
s 1.01 0 5 -1 0
r 0.05 0 0 1.159 0.01
r 0.1 0 0 1.127 0.01
r 0.15 0 0 1.106 0.01
r 0.2 0 0 1.115 0.01
r 0.25 0 0 1.107 0.01
0
sudocracy On

If you are using gawk (GNU awk), you can use the gensub function as follows, leveraging awk's ability to use arbitrary strings as field and record separators (demo):

awk -v RS='s' -v FS='r' -v ORS='' -v OFS='r' \
    '{ $1 = gensub(/[0-9.-]+/, NF - 1, 3, $1) } NR != 1 { print RS $0 }' \
    input.dat

If the gensub function is not available it is a little more involved to split the s line and make the change, but it is still doable (see below).

The approach works as follows:

  • We set the field separator (FS) and record separator(RS) to r and s respectively. This means that awk will read all the s and its r lines as one record, and the number of occurrences of r would be the number of fields (NF) minus 1.
  • We rely on the gensub function to replace the n-th occurrence of a regex match on the first field (i.e., the s row) and return that value. In our case, we would want to replace the third number with the count, i.e., NF -1.
  • The final {print} prints out all lines since it matches all lines in the input. We skip the first row because awk would see an empty row in front of the first s, which we want to ignore.
  • We need to set the output record and field separators (i.e., ORS and OFS) to the same values as they are during input so that awk does not print them out with its default values (i.e., spaces and new lines). Usually, we would set the ORS to s but this would produce a trailing s, and so we handle the output of the s in the output manually with { print RS bit and set ORS to empty.

A variation of this answer that would work even when the gensub function is not available is as follows (demo):

awk -v RS='s' -v FS='r' -v ORS='' -v OFS='r' \
    'NR != 1 { 
        string = ""; 
        split($1, numbers," "); 
        
        numbers[3] = NF - 1; 
        for(i in numbers) string = string " " numbers[i]; 
        
        $1 = string "\n"; 
        print "s" $0
    }'

The logic remains the same, except that since we cannot use a handy regex function to do the replacement, we split the string and replace the part we need.

0
contr.error On

When solving problems for myself I often have to resort to dirty techniques, such as modifying the input. Here, I'm adding a line starting with "s" at the end of the file to avoid creating an END block. If my code were simpler, having an END block would be much preferred, of course. Or, apparently, grokking awk's function syntax would have helped me to simplify as well.

sed '$a\
s
' input |
awk '
BEGIN {
        delete lines[0]
}
{
        line=$0
}
($1 == "s") {
        ln=NR
        if (length(lines) > 0) {
                $0=lines[0]
                delete lines[0]
                $0=$0
                $4=length(lines)
                print
                for (i in lines) {
                        $0 = lines[i]
                        delete lines[i]
                        print
                }
        }
        lines[0]=line
}
($1 == "r") {
        lines[NR-ln]=line
}
'