Extract Regex Capture Group in Script

93 views Asked by At

I am writing a CSH script and attempting to extract text from a source string given a key.

!/bin/csh -f
set source = "Smurfs\n\tPapa\nStar Trek\n\tRenegades\n\tStar Wars\n\tThe Empire Strikes Back\n"
set toFind = "Star Trek"
set regex = "$toFind[\s]*?(.*?)[\s]*?"
set match = `expr $source : $regex`
echo $match

The above code does not work, so I am missing something. I tried placing "Star Trek" directory inside rather than a variable. I should see Regenages as the answer. Had I put "Star Wars" as instead of "Star Trek", I should have seen The Empire Strikes Back.

Google search showed a possible solution using grep, such as

match = `grep -Po '<something>' <<< $source

I did not know what to put for <something>, nor am I an expert in grep.

In the real code, I am reading text from a file. I just simplified things here.

Thoughts?

4

There are 4 answers

2
Sarah Weinberger On BEST ANSWER

The real solution uses a file for the source, so is:

set valueCapture=`cat /mypath/filename | grep -A1 "${tofind}" | grep -v "${tofind}" | xargs`

The code to find a capture value from a string should be (did not test it):

set valueCapture=`cat $source | grep -A1 "${tofind}" | grep -v "${tofind}" | xargs`

In both cases, the what I wish to find is:

set tofind='asdf1@wxyz2'

The xargs part trims off whitespace.

1
Sarah Weinberger On

The following is not a literal answer to my question, as I asked the question for csh, however I wrote a solution using bash.

Match Regex Capture Groups

Match Whitespace How can I match spaces with a regexp in Bash?

I used Tutorial Point to debug.

mystring1='  asdf1@wxyz2  @@a!s#d@f@@  asdf2@wxyz2 b!t#e@g '

tofind='asdf1@wxyz2'
regex="${tofind}[[:space:]]*([.!@\#a-zA-Z0-9]+)"

[[ $mystring1 =~ $regex ]]

echo $'\n'
echo $'\n'
echo '***********************'
echo ${BASH_REMATCH[1]}
echo '***********************'
0
Ed Morton On

Since you said your real input is in a file, here's the file your printf outputs:

$ cat file
Smurfs
        Papa
Star Trek
        Renegades
        Star Wars
        The Empire Strikes Back

and here's how to match and print the strings you want from it:

$ awk -v tgt='Star Trek' '{gsub(/^[[:space:]]+|[[:space:]]+$/,"")} $0==tgt{n=NR+1} NR==n' file
Renegades

$ awk -v tgt='Star Wars' '{gsub(/^[[:space:]]+|[[:space:]]+$/,"")} $0==tgt{n=NR+1} NR==n' file
The Empire Strikes Back

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

0
Paul Hodges On

A pipeline can do it, though it isn't as good as Ed's single process awk.

$: toFind="Star Wars"; echo "$source" |  grep -EA1 "$toFind" | tail -1
        The Empire Strikes Back

$: toFind="Star Trek"; echo "$source" |  grep -EA1 "$toFind" | tail -1
        Renegades

$: echo "$source">file; toFind="Star Trek"; grep -EA1 "$toFind" file | tail -1
        Renegades

A sed would work.

$: toFind="Star Trek"; sed -n "/$toFind/{n
                                         p}" file # should work with any version
        Renegades

$: toFind="Star Wars"; sed -n "/$toFind/{n;p}" file # semicolon is GNU
        The Empire Strikes Back

All of these are probably worth refining your regex.

$: toFind="Star"; sed -n "/$toFind/{n;p}" file
        Renegades
        The Empire Strikes Back

$: toFind="Star"; sed -n "/^$toFind$/{n;p}" file

$: toFind="Star Trek"; sed -n "/^$toFind$/{n;p}" file
        Renegades

$: toFind="Star Wars"; sed -n "/^$toFind$/{n;p}" file # fails because of the leading tab

That last one might mean you have to allow the first one.
Test your logic.