Separating the last number in each line from the letters

Question

Separating the last number in each line from the letters

73 views Asked by narm At 06 March 2024 at 11:54

I have a long file with provisional SNP IDs and alleles, which looks like this:

14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG

I would like to separate the last number in each line from the letters (separate SNP ID from alleles). So to look like this:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

I tried both awk and sed, however, underscore keeps making the problem. For example:

sed 's/^[0-9][0-9]*/& / File1 > File2

gave me

14 _611646T,C
14 _881226CT,C
14 _861416.1GGC,GGCGCGCGCGC

Can anyone help me?

Original Q&A

There are 6 answers

0stone0 On 06 March 2024 at 11:58

sed 's/[[:alpha:]]/ &/' to insert a space before the first letter:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

Wiktor Stribiżew On 06 March 2024 at 12:10

To insert a space between the last digit on a line and the next non-digit char, you can use sed like this:

sed 's/\(.*[0-9]\)\([^0-9]\)/\1 \2/' file # BRE 
sed -E 's/(.*[0-9])([^0-9])/\1 \2/'  file # ERE

Details:

$.*[0-9]$ (BRE) / (.*[0-9]) (ERE) - Group 1 (\1 in the replacement pattern refers to the value captured into this group): any text and then a digit (last occurrence on a line)
$[^0-9]$ (BRE) / ([^0-9]) (ERE) - Group 2 (\2 in the replacement pattern refers to the value captured into this group): a non-digit char.

See the Bash demo online:

#!/bin/bash
s='14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG'

sed 's/\(.*[0-9]\)\([^0-9]\)/\1 \2/' <<< "$s"
sed -E 's/(.*[0-9])([^0-9])/\1 \2/' <<< "$s"

Output:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

Ed Morton On 06 March 2024 at 17:54

Using any sed to add a blank after the last digit:

$ sed 's/.*[0-9]/& /' file
14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

Daweo On 06 March 2024 at 20:20

I would harness GNU AWK for this task following way, let file.txt content be

14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG

then

awk 'BEGIN{FPAT="[0-9_.]*|[ACGT,]*"}{$1=$1;print}' file.txt

gives output

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

Explanation: I inform GNU AWK that field consist of either zero-or-more digits,underscore,dot or ACGT,comma, then I use $1=$1 to trigger string rebuild and then print said line.

(tested in GNU Awk 5.1.0)

potong On 07 March 2024 at 08:21

This might work for you (GNU sed):

sed -E 's/^([^[:alpha:],]*[[:digit:])([[:alpha:]])/\1 \2/' file

A pedantic solution.

N.B. This solution assumes that fields are separated by commas.

**Gilles Quénot** · Accepted Answer · 2024-03-06T11:56:33+00:00

Try to understand what is the most smart way to achieve this.

It's better to avoid using a regex that match all the line, instead try to find the portion that need change.

Using `sed` with `-E` aka `E`xtented `R`egex `E`xpression :

sed -E 's/^[0-9_.]+/& /' file

Yields:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

The regular expression matches as follows:

Node	Explanation
`^`	the beginning of the string anchor
`[0-9_.]+`	any character of: '0' to '9', '_', '.' (1 or more times (matching the most amount possible))

In the right part of sed's substitution, & is what matched in the left part.

Bonus

sed 's/[[:upper:]]/ &/' file

[[:upper:]] is a POSIX regex class meant for all upper case letters.

TechQA.

Separating the last number in each line from the letters

There are 6 answers

Using `sed` with `-E` aka `E`xtented `R`egex `E`xpression :

Yields:

The regular expression matches as follows:

Bonus

Related Questions in AWK

Related Questions in SED

Related Questions in SEPARATOR

Popular Questions

Trending Questions

Separating the last number in each line from the letters

There are 6 answers

Using sed with -E aka Extented Regex Expression :

Yields:

The regular expression matches as follows:

Bonus

Related Questions in AWK

Related Questions in SED

Related Questions in SEPARATOR

Popular Questions

Trending Questions

Using `sed` with `-E` aka `E`xtented `R`egex `E`xpression :