Goodmorning everyone, I have a data.ped
file made up of thousands of columns and hundreds of lines. The first 6 columns and the first 4 lines of the file look like this:
186 A_Han-4.DG 0 0 1 1
187 A_Mbuti-5.DG 0 0 1 1
188 A_Karitiana-4.DG 0 0 1 1
191 A_French-4.DG 0 0 1 1
And I have a ids.txt
file that looks like this:
186 Ignore_Han(discovery).DG
187 Ignore_Mbuti(discovery).DG
188 Ignore_Karitiana(discovery).DG
189 Ignore_Yoruba(discovery).DG
190 Ignore_Sardinian(discovery).DG
191 Ignore_French(discovery).DG
192 Dinka.DG
193 Dai.DG
What I need is to replace (in unix) the value in the first column of the data.ped
file with the value in the second column of the ids.txt
that is in the same line of the value that is going to be replaced from the data.ped
file. For example, I want to replace the "186" value from the data.ped
first column with the "Ignore_Han(discovery).DG" value from the ids.txt
second column (and this because in the first column of the same line of this value there is "186") So the output.ped
file must look like this:
Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1
The values of the first column of the data.ped file are a subset of the values present in the first column of the ids.txt file. So there is always match.
Edit:
I've tried with this:
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped
but when I check the result with:
cut -f 1-6 -d " " output.ped
I get this strange output:
A_Han-4.DG 0 0 1 1y).DG
A_Mbuti-5.DG 0 0 1 1y).DG
A_Karitiana-4.DG 0 0 1 1y).DG
A_French-4.DG 0 0 1 1y).DG
while if I use this command:
cut -f 1-6 -d " " output.ped | less
I get this:
Ignore_Han(discovery).DG^M A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG^M A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG^M A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG^M A_French-4.DG 0 0 1 1
and I can't figure out why there is that ^M in every line.
output:
This is a classic awk task with various modifications according to your requirements. Here we replaced the first field of
data.ped
only if we have found its value in theids.txt
, else we print the line unchanged. If you would like to remove lines that don't match:There is no need for the input files to be sorted and the order of the second file is preserved.
UPDATE:
If you have
Ctrl-M
characters in your inputs, remove them first withfor any
file
you use. In general, I suggest runningdos2unix
for any text files that could contain characters like^M
or\r
, usually coming from dos/windows editing.