I have a file with below input lines.

John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115

I need to extract only the above highlighted text using the grep command.

I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.

find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq

Current Output:

123.NNN
456.1 a

Expected output:

123.NNN
456.1

8 Answers

2
Iren On Best Solutions

You can use another grep regular expression.

find ./ -type f -name f -exec cut -f 4 -d'|' {} +  |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq

. matches any character, [^ ]* matches any sequence of characters until the first space

Output:

123.NNN
456.1
0
Alex Harvey On

It's not possible just using grep. You should use AWK instead:

awk '{split($7, ar, "/"); print ar[3]}' FILE

Explanation:

  • The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
  • Then prints the 3rd field of the array.

Note:

  • I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn

Where aaa and ddd will not contain whitespace.

  • I also assume you really do have a file FILE containing those lines. It's a bit unclear.

Input:

▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115

Output:

▶ awk '{split($7, ar, "/"); print ar[3]}' FILE 
123.NNN
123.NNN
456.1
1
tripleee On

Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like

[0-9]\+\.[A-Z0-9]\+

would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.

find etc etc -exec awk -F '|' '
    $4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
        split($4, a, /\/code\/);
        split(a[2], b); print b[1] }' {} + |
sort -u

The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.

Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.

The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;

find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u

The \K says "match up through here, but forget everything before this" and so only prints the part after this token.

0
James Brown On

An awk using match():

$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file

Output:

123.NNN
456.1

Pretty printed for slightly better readability:

$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
    print b
}' file
1
User123 On

With sed:

sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq

Output:

123.NNN
456.1
0
Anubis On

Single sed can do the filtering. (The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)

sed -nE '[email protected](\S+\s+){6}configuration/code/(\S+)\s.*@\[email protected]' input.txt

To replace your exact command,

find ./ -type f -name <Filename> -exec cat {} \; | sed -nE '[email protected](\S+\s+){6}configuration/code/(\S+)\s.*@\[email protected]' | sort | uniq
1
glenn jackman On

I'd use the -P option:

grep -oP '/code/\K\S+' file | sort -u

You want to extract the non-whitespace characters following /code/

0
Ed Morton On

Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:

$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115

$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1