AWK seperates Lines with capitals and non-capital letters with a semicolon, if there is no semicolon

117 views Asked by At

I have this converted dictionary to use in Pure Data. It consists of a series of 3 things: the word, how to pronounce it, and a semicolon to finish. In the converted dictionary, some semicolons are missing, so I want AWK to find the missings and put semicolons for me. I used delimiters before, but this one is difficult for me, so any help will be appreciated. See the text file: the first 3 are good, the last three are wrong, there is a semicolon missing at the end. I think the AWK delimiter will be between non-capital letters and capital letters, and the action is to put a semicolon if there is no semicolon already. How can I put this in AWK code?

ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin
ELICIT
E
lic
it
ELICIT
E
lic
it

I used some Delimiters before, but i do not know how to specify between in AWK. So the Delimiter is non-capital letters and Capital letters, and put a semicolon there. so some code would look like this awk 'length($0)>1 && line with All capitals put semicolon before this line' or awk 'line with non-capitals if Next line is Capitals put semicolon after line I have tryed this

awk 'length($0>1) && /[:^, upper:]/{l=l";"}NR>1{print l}{l=$0}END{print l}' file2

This is not good working.

Or am i pointing is the wrong direction.

6

There are 6 answers

1
Daweo On BEST ANSWER

I would harness GNU AWK for this task following way, let file.txt content be

ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin
ELICIT
E
lic
it
ELICIT
E
lic
it

then

awk 'BEGIN{RS=""}{print gensub(/([[:lower:]])\n([[:upper:]])/,"\\1;\n\\2","g")}' file.txt

gives output

ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin;
ELICIT
E
lic
it;
ELICIT
E
lic
it

Explanation: setting RS to empty string engage paragraph mode, as file.txt has not blank line, it is treated as 1 row. Then I use gensub string function to replace all (g like globally) occurences of lowercase letter followed by newline followed by uppercase letter by 1st of that letters followed by semicolon followed by newline followed by 2nd letter.

(tested in GNU Awk 5.1.0)

5
Gilles Quénot On

Using shell and sed, the regexes are easy to understand, they are basic:

echo $(< file) |
     sed -E 's/ *;? *\b([A-Z]{2,})\b/;\1/g; s/;//; s/ +/\n/g; s/;/\n;\n/g'

echo $(< file) is a little hack taking advantage of bash's word splitting to split the content of file on one line to be easylly processed by sed.

Yields:

ELFKIN
Elf
kin
;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin
;
ELICIT
E
lic
it
;
ELICIT
E
lic
it
3
jhnc On
awk '
    {
        if ( /^[[:upper:]]{2,}$/ && needs_terminator )
            print ";"
        print
        needs_terminator = !/;/
    }
    END {
        if (needs_terminator)
            print ";"
    }
' file

With your data, gives:

ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin
;
ELICIT
E
lic
it
;
ELICIT
E
lic
it
;
0
potong On

This might work for you (GNU sed):

sed -En '/^;$/d;h
         :a;p;n;/.+;$/{p;b};x;G;s/^(.+)(.*)\n\1$/\2/i;/^$/{x;s/$/;/p;b};x;ba' file

This solution compares the first word of a set with subsequent words until all subsequent words are matched to the first and if need be appends a ;.

N.B. This alters the original file, deleting any single lines containing only a ;. Also this does not rely only on uppercase words to signify the beginning of a word set.

A couple of alternatives:

sed -E '/^;$/d;h
        :a;n;x;G;s/^(.+)(.*)\n\1;?$/\2/i;/^;?$/{x;s/;?$/\n;/;b};x;ba' file

Or just using the uppercaseness of the first word of a set:

sed -E '$!N;${s/;?$/;/;b};/\n[[:upper:]]{2,}/!{P;D};s/;?\n/;\n/' file
1
dawg On

You state I think the AWK delimiter will be between non-capital letters and capital letters, and the action is to put a semicolon if there is no semicolon already.

You can express that in a PCRE regex using a lookback for a lowercase letter with a lookforward for a line ending and uppercase letter.

DEMO

To do that in the shell, you can use this Perl:

perl -0777 -pe 's/(?<=[a-z])(?=\R[A-Z])/;/g' file 

Prints:

ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin;
ELICIT
E
lic
it;
ELICIT
E
lic
it

How does this work?

perl -0777 -pe 's/(?<=[a-z])(?=\R[A-Z])/;/g' file 
       ^                                          'gulp' mode - read as one string
            ^                                     autoprint mode
                ^                                 make a substitution

If you want a line separator between the blocks (you show both \n; and ; as delimiters...) just add \n to the substitution.

You can also use Ruby:

ruby -e 'puts $<.read.gsub(/(?<=[a-z])(?=\R[A-Z])/,";")' file
# same output    

Note: The accepted answer has potentially bad side effects. Suppose you have:

$  ls -1
File1
File2
File3
file

And suppose that file has the contents:

$ cat file
Line 1
Line 2
Line 3 *
Line 4

Now try the accepted answers 'trick' for word splitting:

$ echo $(< file)
Line 1 Line 2 Line 3 File1 File2 File3 file Line 4

You can see that the contents of the file not only are subject the word splitting (which was the intent) but all other shell expansions are performed as well -- in this case the * expanded to the CWD contents and destroyed the file input...

0
Ed Morton On

This, using any POSIX awk, may be what you're trying to do but without expected output in the question, it's a guess:

$ awk '/^[[:upper:]]/ && (prev ~ /[[:lower:]]$/){print ";"} {print; prev=$0}' file
ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin
;
ELICIT
E
lic
it
;
ELICIT
E
lic
it

Note that, unlike some other solutions, the above does not require the whole input file to be read into memory and so will work no matter how large your input is.