Bash: Replace values of a column while retaining line order.

104 views Asked by At

I have a file, FILE1, of some 19.000 lines that has the following format:

PAAXXXX PAAXXXX 0 0 1 -9
PAAXXXY PAAXXXY 0 0 1 -9
PAAXXYX PAAXXYX 0 0 2 -9
PAAXYXX PAAXYXX 0 0 2 -9
PAAYXXX PAAYXXX 0 0 1 -9
PAAYYXX PAAYYXX 0 0 1 -9
PAAYYYX PAAYYYX 0 0 2 -9
PAAAAAB PAAAAAB 0 0 1 -9

The IDs in columns 1 and 2 are the same in each case, col 3 and 4 are always 0, col 5 is either 1 or 2 and col 6 is always -9.

I have two more files, FILE2 and FILE3, which contain only IDs and do not overlap. For example:

head -3 FILE2
    PAAXXYX 
    PAAXYXX
    PAAYXXX

head -2 FILE3
    PAAYYYX
    PAAAAAB

I would like to change the value of col6 in FILE1 based on overlapping of col1 or col2 with the IDs in files 2 and 3. Specifically, I would like to assign 1 to column 6 for all IDs found in FILE2 and 2 for all IDs found in FILE3. My final output might therefore look something like this:

PAAXXXX PAAXXXX 0 0 1 -9
PAAXXXY PAAXXXY 0 0 1 -9
PAAXXYX PAAXXYX 0 0 2 1
PAAXYXX PAAXYXX 0 0 2 1
PAAYXXX PAAYXXX 0 0 1 1
PAAYYXX PAAYYXX 0 0 1 -9
PAAYYYX PAAYYYX 0 0 2 2
PAAAAAB PAAAAAB 0 0 1 2

Note that if some IDs in FILE1 are found neither in FILE2 or FILE3, I would like to retain the original value. Furthermore it is crucial that the order of the output is not changed. It would be fine to solve the problem in two steps, first comparing FILE1 to FILE2 and then FILE1 to FILE3.

I feel that my problem might potentially be solved by running sed in a for-loop but so far I have been unsuccessful. Thank you very much for your help.

1

There are 1 answers

11
123 On BEST ANSWER

In awk, using ARGIND

awk 'ARGIND~"1|2"{a[$1]=ARGIND;next}a[$1]{$NF=a[$1]}1' FILE2 FILE3 FILE1

PAAXXXX PAAXXXX 0 0 1 -9
PAAXXXY PAAXXXY 0 0 1 -9
PAAXXYX PAAXXYX 0 0 2 1
PAAXYXX PAAXYXX 0 0 2 1
PAAYXXX PAAYXXX 0 0 1 1
PAAYYXX PAAYYXX 0 0 1 -9
PAAYYYX PAAYYYX 0 0 2 2
PAAAAAB PAAAAAB 0 0 1 2