Bash/Nawk whitespace problems

154 views Asked by At

I have 100 datafiles, each with 1000 rows, and they all look something like this:

0       0   0   0
1       0   1   0
2       0   1   -1
3       0   1   -2
4       1   1   -2
5       1   1   -3
6       1   0   -3
7       2   0   -3
8       2   0   -4
9       3   0   -4
10      4   0   -4
.
.
.
999     1   47  -21
1000        2   47  -21

I have developed a script which is supposed to take the square of each value in columns 2,3,4, and then sum and square root them. Like so:

temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)

It then calculates the square of that value, and averages these numbers over every data-file to output the average "calc" for each row and average "fluc" for each row.

The meaning of these numbers is this: The first number is the step number, the next three are coordinates on the x, y and z axis respectively. I am trying to find the distance the "steps" have taken me from the origin, this is calculated with the formula r = sqrt(x^2 + y^2 + z^2). Next I need the fluctuation of r, which is calculated as f = r^4 or f = (r^2)^2. These must be averages over the 100 data files, which leads me to:

r = r + sqrt(x^2 + y^2 + z^2)
avg = r/s

and similarly for f where s is the number of read data files which I figure out using sum=$(ls -l *.data | wc -l). Finally, my last calculation is the deviation between the expected r and the average r, which is calculated as stddev = sqrt(fluc - (r^2)^2) outside of the loop using final values.

The script I created is:

#!/bin/bash

sum=$(ls -l *.data | wc -l)
paste -d"\t" *.data | nawk -v s="$sum" '{
    for(i=0;i<=s-1;i++)
    {
        t1 = 2+(i*4)
        t2 = 3+(i*4)
        t3 = 4+(i*4)
        temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
        calc = $calc + sqrt ($temp)
        fluc = $fluc + ($calc*$calc)
    }
    stddev = sqrt(($calc^2) - ($fluc))
    print $1" "calc/s" "fluc/s" "stddev
    temp=0
    calc=0
    stddev=0
}'

Unfortunately, part way through I receive an error:

nawk: cmd. line:9: (FILENAME=- FNR=3) fatal: attempt to access field -1

I am not experienced enough with awk to be able to figure out exactly where I am going wrong, could someone point me in the right direction or give me a better script?

The expected output is one file with:

0 0 0 0
1 (calc for all 1's) (fluc for all 1's) (stddev for all 1's)
2 (calc for all 2's) (fluc for all 2's) (stddev for all 2's)
.
.
.
2

There are 2 answers

5
Marcus Rickert On BEST ANSWER

The following script should do what you want. The only thing that might not work yet is the choice of delimiters. In your original script you seem to have tabs. My solution assumes spaces. But changing that should not be a problem.

It simply pipes all files sequentially into the nawk without counting the files first. I understand that this is not required. Instead of trying to keep track of positions in the file it uses arrays to store seperate statistical data for each step. In the end it iterates over all step indexes found and outputs them. Since the iteration is not sorted there is another pipe into a Unix sort call which handles this.

#!/bin/bash
# pipe the data of all files into the nawk processor
cat *.data | nawk ' 
BEGIN { 
  FS=" "                         # set the delimiter for the columns
} 
{
  step = $1                      # step is in column 1
  temp = $2*$2 + $3*$3 + $4*$4

  # use arrays indexed by step to store data
  calc[step] = calc[step] + sqrt (temp)
  fluc[step] = fluc[step] + calc[step]*calc[step]
  count[step] = count[step] + 1   # count the number of samples seen for a step
}
END {
  # iterate over all existing steps (this is not sorted!)
  for (i in count) {
    stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
    print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
  }
}' | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"

EDIT

As sugested by @edmorton awk can take care of loading the files itself. The following enhanced version removes the call to cat and instead passes the file pattern as parameter to nawk. Also, as suggested by @NictraSavios the new version introduces a special handling for the output of the statistics of the last step. Note that the gathering of the statistics is still done for all steps. It's a little difficult to suppress this during the reading of the data since at that point we don't know yet what the last step will be. Although this can be done with some extra effort you would probably loose a lot of robustness of your data handling since right now the script does not make any assumptions about:

  • the number of files provided,
  • the order of the files processed,
  • the number of steps in each file,
  • the order of the steps in a file,
  • the completeness of steps as a range without "holes".

Enhanced script:

#!/bin/bash
nawk ' 
BEGIN { 
  FS=" "   # set the delimiter for the columns (not really required for space which is the default)
  maxstep = -1
} 
{
  step = $1                      # step is in column 1
  temp = $2*$2 + $3*$3 + $4*$4

  # remember maximum step for selected output
  if (step > maxstep)
    maxstep = step

  # use arrays indexed by step to store data
  calc[step] = calc[step] + sqrt (temp)
  fluc[step] = fluc[step] + calc[step]*calc[step]
  count[step] = count[step] + 1   # count the number of samples seen for a step
}
END {
  # iterate over all existing steps (this is not sorted!)
  for (i in count) {
    stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
    if (i == maxstep)
      # handle the last step in a special way
      print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
    else
      # this is the normal handling
      print i" "calc[i]/count[i]
  }
}' *.data | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
5
Håkon Hægland On

You could also use:

awk -f c.awk *.data

where c.awk is

{
    j=FNR
    temp=$2*$2+$3*$3+$4*$4
    calc[j]=calc[j]+sqrt(temp)
    fluc[j]=fluc[j]+calc[j]*calc[j]
}

END {
    N=ARGIND
    for (i=1; i<=FNR; i++) {
        stdev=sqrt(fluc[i]-calc[i]*calc[i])
        print i-1,calc[i]/N,fluc[i]/N,stdev
    }
}