unable to convert tab delimited .txt file to csv

1k views Asked by At

I am using tr to convert tab delimited .txt file to comma delimited csv file. Here is my file output

arron@arron-Ideapad-Z570 ~/Phd $ cat test_pph_s.txt | cut -f 1,2,3
#o_acc                   o_pos  o_aa1
ENSG00000145888            455      H
ENSG00000145888            450      R
ENSG00000145888            440      M
ENSG00000145888            428      R
ENSG00000145888            428      R

here is full file link https://drive.google.com/file/d/0B0iDswLYaZ0zV3ktekhyeGxwTlk/view?usp=sharing

and trying to swap whitespace for commas using tr:

arron@arron-Ideapad-Z570 ~/Phd $ cat test_pph_s.csv | cut -f 1,2,3 | tr "\\t" ","
#o_acc               , o_pos,o_aa1
ENSG00000145888     ,   455,    H
ENSG00000145888     ,   450,    R
ENSG00000145888     ,   440,    M
ENSG00000145888     ,   428,    R
ENSG00000145888     ,   428,    R

which appears to just add a comma in and not remove the whitespace.

However I notice if I open the .txt file in libreoffice and save as .csv, it gets converted to utf8 format and then running

arron@arron-Ideapad-Z570 ~/Phd $ cat test_pph_s.csv | cut -f 1,2,3 | tr "\\t" ","

gives me

#o_acc,o_pos,o_aa1
    ENSG00000145888,455,H
    ENSG00000145888,450,R
    ENSG00000145888,440,M
    ENSG00000145888,428,R
    ENSG00000145888,428,R

which is what I want.

what am I doing wrong in my first attempt? It seems to me that using tr is not the "correct" way to convert to a .csv file.

thanks.

3

There are 3 answers

1
Leon On BEST ANSWER

It seems you have mix of tabs and spaces

cut -f 1,2,3 < input.txt | tr -s [:blank:] ','

Here tr will collapse all white space to a single character and then replace it with comma. You also do not need cat, but you can use it if you prefer it that way :)

0
David C. Rankin On

A simple bash script with a while read loop can handle it simply. Here you simply read in all of the fields of data from the file, and then output the ones you are interested in, separated by a comma. The first argument is the filename to read (just defaulting to your test file). You can also include a second argument passing the output file name if you like and redirect the output to that filename. The script below, just used the field names as variable to hold the values of each field on read, then outputs the ones requested (you can use any variable name you like, I just found it easier to retain the field names):

#!/bin/bash

fn="${1:-dat/test_pph_s.txt}"

while read o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 prediction pph2_prob pph2_FPR pph2_TPR ||
[ -n "$pph2_TPR" ]; do
    printf "%s,%s,%s\n" "$o_acc" "$o_pos" "$o_aa1"
done <"$fn"

exit 0

output:

$ bash tab2csv.sh
#o_acc,o_pos,o_aa1
ENSG00000145888,455,H
ENSG00000145888,450,R
ENSG00000145888,440,M
ENSG00000145888,428,R
ENSG00000145888,428,R
ENSG00000145888,413,R
ENSG00000145888,412,M
ENSG00000145888,406,S
ENSG00000145888,402,P
1
JNevill On

I'm unable to replicate using iconv to convert between ascii and utf8. I'm not sure what libreoffice is doing that you are getting extra characters at the front of each record. Try using awk to accomplish the same thing and see if it has better success:

awk -F'\t' 'BEGIN {OFS=","} {print $1, $2, $3}' test_pph_s.csv

This is taking in a tab delimited file -F'\t' and outputting comma delimited OFS=','