Replace a value if this value is present in a txt file

59 views Asked by At

Goodmorning everyone, I have a data.ped file made up of thousands of columns and hundreds of lines. The first 6 columns and the first 4 lines of the file look like this:

186 A_Han-4.DG 0 0 1 1
187 A_Mbuti-5.DG 0 0 1 1
188 A_Karitiana-4.DG 0 0 1 1
191 A_French-4.DG 0 0 1 1

And I have a ids.txt file that looks like this:

186 Ignore_Han(discovery).DG
187 Ignore_Mbuti(discovery).DG
188 Ignore_Karitiana(discovery).DG
189 Ignore_Yoruba(discovery).DG
190 Ignore_Sardinian(discovery).DG
191 Ignore_French(discovery).DG
192 Dinka.DG
193 Dai.DG

What I need is to replace (in unix) the value in the first column of the data.ped file with the value in the second column of the ids.txt that is in the same line of the value that is going to be replaced from the data.ped file. For example, I want to replace the "186" value from the data.ped first column with the "Ignore_Han(discovery).DG" value from the ids.txt second column (and this because in the first column of the same line of this value there is "186") So the output.ped file must look like this:

Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1

The values of the first column of the data.ped file are a subset of the values present in the first column of the ids.txt file. So there is always match.


Edit:

I've tried with this:

awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped

but when I check the result with:

cut -f 1-6 -d " " output.ped

I get this strange output:

A_Han-4.DG 0 0 1 1y).DG
A_Mbuti-5.DG 0 0 1 1y).DG
A_Karitiana-4.DG 0 0 1 1y).DG
A_French-4.DG 0 0 1 1y).DG

while if I use this command:

cut -f 1-6 -d " " output.ped | less

I get this:

Ignore_Han(discovery).DG^M A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG^M A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG^M A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG^M A_French-4.DG 0 0 1 1

and I can't figure out why there is that ^M in every line.

2

There are 2 answers

0
thanasisp On BEST ANSWER
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]} 1' ids.txt data.ped

output:

Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1

This is a classic awk task with various modifications according to your requirements. Here we replaced the first field of data.ped only if we have found its value in the ids.txt, else we print the line unchanged. If you would like to remove lines that don't match:

awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped

There is no need for the input files to be sorted and the order of the second file is preserved.


UPDATE:

If you have Ctrl-M characters in your inputs, remove them first with

cat file | tr -d '^M' > file.tmp && mv file.tmp file

for any file you use. In general, I suggest running dos2unix for any text files that could contain characters like ^M or \r, usually coming from dos/windows editing.

0
Zoro On

Use join command to join two files

join ids.txt data.ped > temp

You can use cut command to remove the first column like:

cut -d " " -f 2- temp > output.ped