I have 100 datafiles, each with 1000 rows, and they all look something like this:
0 0 0 0
1 0 1 0
2 0 1 -1
3 0 1 -2
4 1 1 -2
5 1 1 -3
6 1 0 -3
7 2 0 -3
8 2 0 -4
9 3 0 -4
10 4 0 -4
.
.
.
999 1 47 -21
1000 2 47 -21
I have developed a script which is supposed to take the square of each value in columns 2,3,4, and then sum and square root them. Like so:
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
It then calculates the square of that value, and averages these numbers over every data-file to output the average "calc" for each row and average "fluc" for each row.
The meaning of these numbers is this:
The first number is the step number, the next three are coordinates on the x, y and z axis respectively. I am trying to find the distance the "steps" have taken me from the origin, this is calculated with the formula r = sqrt(x^2 + y^2 + z^2)
. Next I need the fluctuation of r, which is calculated as f = r^4
or f = (r^2)^2
.
These must be averages over the 100 data files, which leads me to:
r = r + sqrt(x^2 + y^2 + z^2)
avg = r/s
and similarly for f where s is the number of read data files which I figure out using sum=$(ls -l *.data | wc -l)
.
Finally, my last calculation is the deviation between the expected r
and the average r
, which is calculated as stddev = sqrt(fluc - (r^2)^2)
outside of the loop using final values.
The script I created is:
#!/bin/bash
sum=$(ls -l *.data | wc -l)
paste -d"\t" *.data | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*4)
t2 = 3+(i*4)
t3 = 4+(i*4)
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
fluc = $fluc + ($calc*$calc)
}
stddev = sqrt(($calc^2) - ($fluc))
print $1" "calc/s" "fluc/s" "stddev
temp=0
calc=0
stddev=0
}'
Unfortunately, part way through I receive an error:
nawk: cmd. line:9: (FILENAME=- FNR=3) fatal: attempt to access field -1
I am not experienced enough with awk to be able to figure out exactly where I am going wrong, could someone point me in the right direction or give me a better script?
The expected output is one file with:
0 0 0 0
1 (calc for all 1's) (fluc for all 1's) (stddev for all 1's)
2 (calc for all 2's) (fluc for all 2's) (stddev for all 2's)
.
.
.
The following script should do what you want. The only thing that might not work yet is the choice of delimiters. In your original script you seem to have tabs. My solution assumes spaces. But changing that should not be a problem.
It simply pipes all files sequentially into the
nawk
without counting the files first. I understand that this is not required. Instead of trying to keep track of positions in the file it uses arrays to store seperate statistical data for each step. In the end it iterates over all step indexes found and outputs them. Since the iteration is not sorted there is another pipe into a Unixsort
call which handles this.EDIT
As sugested by @edmorton
awk
can take care of loading the files itself. The following enhanced version removes the call tocat
and instead passes the file pattern as parameter tonawk
. Also, as suggested by @NictraSavios the new version introduces a special handling for the output of the statistics of the last step. Note that the gathering of the statistics is still done for all steps. It's a little difficult to suppress this during the reading of the data since at that point we don't know yet what the last step will be. Although this can be done with some extra effort you would probably loose a lot of robustness of your data handling since right now the script does not make any assumptions about:Enhanced script: