Extract multiple strings from a multiple lines text file - larger test file

88 views Asked by At

How to extract multiple strings from a multiple lines text file with these rules? The search strings are "String server", " pac " and "String method". They may or may not appear only once within the enclosing "{}". After the search strings are matched, extract their values enclosed within "" without "()". The value of either search string "String server" or " pac " appear only once - no duplication. Its value will appear before the value of the search string "String method". e.g. sample text file in:

{{{{{
public AResponse retrieveA(ARequest req){
    String server = "AAA";
    String method =  "retrieveA()";
    log.info(method,
            server,
            req);
    return res;
}

public BResponse retrieveB(BRequest req){
    String method =  "retrieveB()";
    BBB pac = new BBB();
    log.info(method,
            pac,
            req);
    return res;
}

public CResponse retrieveC(CRequest req) {
    String server = "CCC";
    log.info(server,
            req);
    return res;
}

public DResponse retrieveD(DRequest req) {
    String method = "retrieveD()";
    log.info(method,req);
    return res;
}

public EResponse retrieveE(ERequest req){
    EEE pac = new EEE();
    String method =  "retrieveE()";
    String server = "EEE";
    log.info(method,
            server,
            pac,
            req);
    return res;
}

public FResponse callretrieveF(FRequest req) throws InvalidDataException {
        String server = "FFFFF";
        //retrieveF
        String method =  "retrieveF()";
        try {
            log.info(method,
                     server,
                     req);

            FFFFF pac = new FFFFF();
        }
}

/**
 * callgetG
* getG
*/
public GResponse callgetG(GRequest req) throws InvalidDataException {
        //getG
        String method =  "getG()";
        String server = "GGGGGG";
        try {
            try {
                GGGGGG pac = new GGGGGG();
                log.info(method,
                     server,
                     req);
            }
        }
}

/**
 * getH
*/
    public HResponse getH(HRequest req) 
                                throws InvalidDataException {

        //getH
        String method =  "getH()";
        String server = "HHHHHHH";
        String calledMethod =  "getH2()";

        ARequest aReq = new ARequest(req.getH(),
                                     req.getR());
        ProgramAccountInformationResponse resp = null;
        try {
            log.info(LogMessages.msgInfoMethodStartPrivate(method,
                                                           server,
                                                           calledMethod,
                                                           req));
            return resp;
        }catch(InvalidDataException ide){
            log.error(method);
            throw ide;
        }
    }

}}}}}

Expected output:

AAA retrieveA
BBB retrieveB
CCC 
retrieveD
EEE retrieveE
FFFFF retrieveF
GGGGGG getG
HHHHHHH getH

I tried the solution from: Extract multiple strings from a multiple lines text file

awk -v OFS='\t' -F= '
/\{[[:blank:]]*$/ {++n}
NF==2 && /String | pac/ {
   gsub(/^[[:blank:]]*("|new +)|[()";]+$/, "", $2)
   if ($1 ~ / (server|pac)/)
      col1[n] = $2
   else if ($1 ~ / method/)
      col2[n] = $2
}
END {
   for (i=1; i<=n; ++i)
      print col1[i], col2[i]
}' in
3

There are 3 answers

4
anubhava On BEST ANSWER

You may try a modified version of the previous awk script here.

cat parse.awk

BEGIN { OFS="\t"; FS="=" }
/^[[:blank:]]*public / {++n}
NF==2 && /^[[:blank:]]*String | pac *=/ {
   gsub(/^[[:blank:]]*("|new +)|[()";]+$/, "", $2)
   if ($1 ~ / (server|pac)/)
      col1[n] = $2
   else if ($1 ~ / method/)
      col2[n] = $2
}
END {
   for (i=1; i<=n; ++i)
      print col1[i], col2[i]
}

Then use it as:

awk -f parse.awk file

AAA     retrieveA
BBB     retrieveB
CCC
        retrieveD
EEE     retrieveE
FFFFF   retrieveF
GGGGGG  getG
HHHHHHH getH
1
Kaz On

A minor change to the TXR solutions handles it:

$ txr extract2.txr longer-data
AAA retrieveA
BBB retrieveB
CCC
retrieveD
EEE retrieveE
FFFFF retrieveF
GGGGGG getG
HHHHHHH getH

Code:

@(repeat)
@(freeform 2)
@/ */public@nil{
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
@/ */}
@  (end)
@  (do
     (put-line
       (cond
         ((and server meth) `@server @meth`)
         ((and meth pac) `@pac @meth`)
         (server)
         (meth))))
@(end)

A key detail is the @(freeform 2) which instructs TXR to treat the next two lines as if they were one line (with embedded \n characters), then match into them. Any unmatched material is divided back into multiple lines and pushed back into the input. This easily handles the public ... function header that is split across two lines. We recognize some optional spaces since there is an occurrence of public not in the first column, and also a closing brace not in the first column.

I see that the indentation matches: the curly brace which closes each function is aligned with the public. We could make the matching stricter so that we match a closing brace which is at the same indentation as public, and not just any curly brace.

This doesn't change the output, but is good to know about:

@(repeat)
@(freeform 2)
@{indent / */}public@nil{
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
@indent}
@  (end)
@  (do
     (put-line
       (cond
         ((and server meth) `@server @meth`)
         ((and meth pac) `@pac @meth`)
         (server)
         (meth))))
@(end)

By changing the regex match @/ */ (one or more spaces) to @{indent / */}, we capture that indentation into the indent variable. Later when we mention the @indent variable, it has to match exactly the same text: same number of spaces.

Thus if we had a situation like this, it would still work:

        String method =  "getH()";
        if (whatever) {
          // ...
        }
        String server = "HHHHHHH";
        String calledMethod =  "getH2()";

It wouldn't think that the closing brace of the if is the end of the function, and pick up the server variable.

In the first place, we don't have to look for closing braces; that's just a decision I made. Since the functions all start with public, we could use the presence of public as a section delimiter:

@(repeat)
@/.*public.*/
@  (gather :vars ((server nil) (meth nil) (pac nil)))
 String server = "@server";
 String method = "@meth()";
 @pac pac = new @pac();
@  (until)
@/.*public.*/
@  (end)
@  (do
     (put-line
       (cond
         ((and server meth) `@server @meth`)
         ((and meth pac) `@pac @meth`)
         (server)
         (meth))))
@(end)
1
ArtyLee On

Since I was not so versed at awk, I tried python for the job. Interestingly I need much more code than in awk. But it delivers the requested output and is readable. Though it still can be improved in style and campactness. Btw it seems you don't need the keyword "pac".

'''
parses the file input.txt in the same path
as this python file parser.py
and extracts keywords and values
'''
import pathlib
import re
kw = ["String server", " pac ", "String method"] #list of keywords

def extract_AAA(input_str):
    match_res = None
    keyword = "[A-Z]{3}"
    match_res = re.search(keyword, input_str)
    if match_res:
        start = match_res.span()[0]
        end = match_res.span()[1]
        #extract retrieveA...D, AAA...DDD
        return match_res.string[start:end]
    return match_res

def extract_retrieve(input_str):
    match_res = None
    keyword = "retrieve[A-Z]{1}"
    match_res = re.search(keyword, input_str)
    if match_res:
        start = match_res.span()[0]
        end = match_res.span()[1]
        #extract retrieveA...D, AAA...DDD
        return match_res.string[start:end]
    return match_res

def write_to_output_AAA(res_AAA):
    fout = open("output.txt", 'a')
    fout.write(res_AAA + " ")
    fout.close()

def write_to_output_retr(res_retr):
    fout = open("output.txt", 'a')
    fout.write(res_retr + "\n")
    fout.close()

def write_newline():
    fout = open("output.txt", 'a')
    fout.write("\n")
    fout.close()

path = pathlib.Path(__file__).parent.resolve()
print(path)
finput = open("input.txt")
lines = finput.readlines()

def find_a_keyword(n):
    j = 0
    for keyword in kw:
        j = lines[n].find(keyword)
        if j >= 0:
            return j
    return j 

i = 0
#for i in range(len(lines) - 1):
while(i < len(lines) - 2):
    res_AAA = None
    res_retr = None
    if find_a_keyword(i) >= 0:
        res_AAA = extract_AAA(lines[i])
        res_retr = extract_retrieve(lines[i])
        if res_AAA:
            write_to_output_AAA(res_AAA)
            if find_a_keyword(i+1) >= 0:
                res_retr = extract_retrieve(lines[i+1])
                if res_retr:
                    write_to_output_retr(res_retr)
                    #special case: a third line with AAA - ZZZ
                    if  ( (find_a_keyword(i+1) >= 0 ) 
                            and
                            (
                                (extract_retrieve(lines[i+2]) != None) 
                                or 
                                (extract_AAA(lines[i+2]) != None)
                            )
                        ):
                        i += 3
                    else:
                        i += 2
                    continue
                else:
                    write_newline()
                    i += 1
                    continue
                continue
            else:
                write_newline()
                i += 1
                continue
            continue
        elif res_retr: #find_a_keyword(i + 1) >= 0:
            res_AAA = extract_AAA(lines[i+1])
            if res_AAA:
                write_to_output_AAA(res_AAA)
                write_to_output_retr(res_retr)
                i += 2
                continue
            else:
                write_to_output_retr(res_retr)
                i += 1
                continue
        else:
            continue
            i += 1
    else:
        i += 1