Separating the last number in each line from the letters

73 views Asked by At

I have a long file with provisional SNP IDs and alleles, which looks like this:

14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG

I would like to separate the last number in each line from the letters (separate SNP ID from alleles). So to look like this:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

I tried both awk and sed, however, underscore keeps making the problem. For example:

sed 's/^[0-9][0-9]*/& / File1 > File2

gave me

14 _611646T,C
14 _881226CT,C
14 _861416.1GGC,GGCGCGCGCGC

Can anyone help me?

6

There are 6 answers

0
Gilles Quénot On BEST ANSWER

Try to understand what is the most smart way to achieve this.

It's better to avoid using a regex that match all the line, instead try to find the portion that need change.

Using sed with -E aka Extented Regex Expression :

sed -E 's/^[0-9_.]+/& /' file

Yields:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

The regular expression matches as follows:

Node Explanation
^ the beginning of the string anchor
[0-9_.]+ any character of: '0' to '9', '_', '.' (1 or more times (matching the most amount possible))

In the right part of sed's substitution, & is what matched in the left part.

Bonus

sed 's/[[:upper:]]/ &/' file

[[:upper:]] is a POSIX regex class meant for all upper case letters.

0
0stone0 On

sed 's/[[:alpha:]]/ &/' to insert a space before the first letter:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG
0
Wiktor Stribiżew On

To insert a space between the last digit on a line and the next non-digit char, you can use sed like this:

sed 's/\(.*[0-9]\)\([^0-9]\)/\1 \2/' file # BRE 
sed -E 's/(.*[0-9])([^0-9])/\1 \2/'  file # ERE

Details:

  • \(.*[0-9]\) (BRE) / (.*[0-9]) (ERE) - Group 1 (\1 in the replacement pattern refers to the value captured into this group): any text and then a digit (last occurrence on a line)
  • \([^0-9]\) (BRE) / ([^0-9]) (ERE) - Group 2 (\2 in the replacement pattern refers to the value captured into this group): a non-digit char.

See the Bash demo online:

#!/bin/bash
s='14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG'

sed 's/\(.*[0-9]\)\([^0-9]\)/\1 \2/' <<< "$s"
sed -E 's/(.*[0-9])([^0-9])/\1 \2/' <<< "$s"

Output:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG
0
Ed Morton On

Using any sed to add a blank after the last digit:

$ sed 's/.*[0-9]/& /' file
14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG
0
Daweo On

I would harness GNU AWK for this task following way, let file.txt content be

14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG

then

awk 'BEGIN{FPAT="[0-9_.]*|[ACGT,]*"}{$1=$1;print}' file.txt

gives output

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

Explanation: I inform GNU AWK that field consist of either zero-or-more digits,underscore,dot or ACGT,comma, then I use $1=$1 to trigger string rebuild and then print said line.

(tested in GNU Awk 5.1.0)

0
potong On

This might work for you (GNU sed):

sed -E 's/^([^[:alpha:],]*[[:digit:])([[:alpha:]])/\1 \2/' file

A pedantic solution.

N.B. This solution assumes that fields are separated by commas.