how to parse a file in bash, awk... for linking program and subprogram in a more efficient way

134 views Asked by At

I have the need to generate a tree view for a list of program and subprogram we actually use.

we have a working script right now (in ksh), but it's take a very long time to achieve (between 3 and 4 hours), and i'm pretty sure it can be done more efficiently/quickly

the actual working script is using two files : 1 : the list of program we actually using (2926 lines) 2 : the list of all programs and subprograms associated to this program (23922 lines)

for better comprehension, this is a representation of the file 2 :

A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X

the program A call the subprograms B, C, D and E The program B (called by A) call the subprogram F and G The program C (called by A) call the subprogram J K and L etc...

from this two files, we need to write a csv file who represent the tree view of all the programs we use in the first file.

This is the expected output from the previous representation:

A
;B
;;F
;;G
;C
;;J
;;K
;;;Z
;;;;Y
;;;;X
;;L
;;;M
;;;N
;E

To achieve that, this is what we actually do :

part of the main script who call the csv_tree.sh :

#!/bin/ksh
cat program_list | awk -F "|" '{print($2)}' | sort | uniq >program_list.tmp
while read -r line; do
    begin=$(echo "$line" | awk -F "|" '{print $1}')
    sh csv_tree.sh "$begin" 2>/dev/null
done <$HOME/FISCALITE/OSCAR/FICHIER_IMPORT/FIC_PRG_TREE.tmp
#delete temporary file
rm program_list.tmp

csv_tree.sh :

#!/bin/ksh
#set -x
entry_file="T_SS_PRG_N1"
output_csv_file="${1}_tree.csv"
: >"$output_csv_file"
echo $1 >>"$output_csv_file"
count=0
rank=0
chaine=$(grep -E "^$1" $entry_file | awk '{$1=""}1')
#function loop
loop() {
    rank=$((rank + 1))
    for prg in $2; do
        count=$((count + 1))
        if [[ -n $(echo ${prg} | grep ${chaine}) ]]; then
            rank=1
        fi
        #   calculate the number of ";" to print in the csv file
        separator=$(for ((i = 0; i < rank; ++i)); do echo -n ";"; done)
        echo "${separator}${prg}" >>"$output_csv_file"
        if [ $(echo ${prg} | egrep "^(X|Y|Z)") != "" ] && [ $(grep "^${prg}" ${entry_file} | awk -F\| '{print $2}' | tr -d ' ') != "" ]; then
            echo "${separator};$(grep "^${prg}" ${entry_file} | awk -F\| '{print $2}')" >>"$output_csv_file"
        else
            loop "${prg}_$count" "$(grep -E "^$prg" $entry_file | awk '{$1=""}1')"
            rank=$((rank - 1))
        fi
    done
}

#begin
loop "$1" "$chaine"

I'm pretty sure we can achieve this with pure awk, but i'm not really into it :/

If someone have some tips or other way to do it better, i'm wide open ^^

3

There are 3 answers

2
Kaz On

In TXR Lisp:

$ cat file1
A
B
C
E
L
$ cat file2
A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X
$ txr deps.tl file1 file2
A
;B
;;F
;;G
;C
;;J
;;K
;;;Z
;;;;Y
;;;;X
;;L
;;;M
;;;N
;D
;E
B
;F
;G
C
;J
;K
;;Z
;;;Y
;;;X
;L
;;M
;;N
E
L
;M
;N

Code looks like:

(defstruct mod ()
  name
  children)

(defvarl mh (hash))

(defun resolve-names (m)
  (when (and m.children
             (stringp (first m.children)))
    (set m.children
         (append-each ((c m.children))
           (iflet ((cm [mh c]))
             (list cm))))))

(defun print-tree (m : path (prefix ""))
  (put-line `@prefix@{m.name}`)
  (unless (member m path)
    (let ((nprefix `;@prefix`)
          (npath (cons m path)))
      (each ((c m.children))
        (print-tree c npath nprefix)))))

(tree-bind (file1 file2) *args*
  (let ((used (file-get-lines "file1"))
        (deps (file-get-lines "file2")))
    ;; populate object hash with each definition
    (each ((d deps))
      (match `@left|@right` d
        ;; look up or create new node for left side
        (let ((m (or [mh left]
                     (set [mh left] (new mod name left))))
              (ch (tok #/\S+/ right)))
          ;; ensure node exists for each child
          (each ((c ch))
            (or [mh c] (set [mh c] (new mod name c))))
          ;; append children to left hand node
          (set m.children (append m.children ch)))))
    ;; convert child names to objects
    (dohash (n m mh)
      (resolve-names m))
    ;; print tree view for each used object
    (each ((u used))
      (iflet ((m [mh u]))
        (print-tree m)))))

There is a check in print-tree against circular references, but not against duplicates (diamond references).

3
markp-fuso On

Assumptions:

  • we are to generate trees only for root programs (root program == a program that is not called by other programs)
  • OP has multiple root programs
  • OP mentions two files - File1 and File2 - but uses other names in the actual code
  • I'm guessing File2 == program_list
  • we're not shown the contents of File1 so I'm proceeding as if File1 does not exist; OP will likely need to modify the proposed code (below) based on the purpose of File1

We'll duplicate OP's sample data to include 2 root programs (A and A1):

$ cat program_list
A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X
A1|B1 C1 D1 E1
B1|F1 G1
C1|J1 K1 L1
E1|
K1|Z1
L1|M1 N1
Z1|Y1 X1

While other languages are going to have better performance, here's one approach using bash that should have better performance than OP's current code (if simply because it does not invoke any subshells):

$ cat build_tree.bash
#!/bin/bash

unset   pc c p r                                 # just in case script is sourced
declare -A c p r                                 # associative arrays for p(arent), c(hild) and r(oot) nodes

#######
# parse data into p[], c[] and parent_child ( pc_<parent>[] ) arrays

while read -r line
do
    IFS='[ |]' read -ra a <<< "${line}"          # a[0]==parent, a[1-n]==children

    parent="${a[0]}"

    p[$parent]=1
    declare -n pc=pc_"${parent}"                 # nameref to create pc_<parent>=( children ) arrays, eg: pc_A=( B C D E )
    pc=()

    for ((i=1 ; i<${#a[@]} ; i++))               # loop through child nodes
    do
        child="${a[$i]}"
        c[$child]=1                              # add as index in c[] array; c[] used in next loop to determine which nodes are roots
        pc+=( "$child" )                         # add child to pc == pc_<parent>[] array
    done
done < program_list                              # process File2

#######
# determine root nodes

for node in "${!p[@]}"                           # loop through p[] array indices
do
    [[ -z "${c[$node]}" ]] && r[$node]=1         # if not an index in the c[] array then add as index in the r[] array
done

unset child                                      # no longer needed so go ahead and release memory

#######
# recursive function
#
# for a given parent:
# - loop through array of children
# - print a child and if the child has children then make recursive call

recurse() {
    local parent pfx pc child

    parent="$1"                                  # eg: A, B, D1
    pfx="$2"                                     # eg: ';', ';;', ';;;'

    declare -n pc=pc_"${parent}"                 # nameref for pc_<parent>[] array

    for child in "${pc[@]}"                      # loop through child nodes
    do
        echo "${pfx}${child}"                    # print child node

        # if child is also a parent then make recursive call

        [[ -n "${p[$child]}" ]] && recurse "${child}" "${pfx}${sep}"
    done
}

#######
# main step: loop through root nodes and make top-level recurse() calls

sep=';'

for root in "${!r[@]}"
do
    echo "########### new tree:"
    echo "${root}"
    recurse "${root}" "${sep}"
done

#######
# release memory ... just in case script is sourced

for parent in "${!p[@]}"
do
    declare -n pc=pc_"${parent}"                 # nameref for pc_<parent>[] array
    unset pc
done

unset c p r

NOTES:

  • requires bash 4.2+ for nameref (declare -n) support

Taking for a test drive:

$ ./build_tree.bash
########### new tree:
A
;B
;;F
;;G
;C
;;J
;;K
;;;Z
;;;;Y
;;;;X
;;L
;;;M
;;;N
;D
;E
########### new tree:
A1
;B1
;;F1
;;G1
;C1
;;J1
;;K1
;;;Z1
;;;;Y1
;;;;X1
;;L1
;;;M1
;;;N1
;D1
;E1
2
Daweo On

I would harness GNU AWK for this task following way, let file.txt content be

A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X

then

awk 'function getpath(name){return (name in arr)?sprintf("%-10s%-10s",getpath(arr[name]),name):sprintf("%-10s",name)}function printasrecord(path){patsplit(path,parts,/.{10}/);depth=length(parts);$0="";$depth=parts[depth];print}BEGIN{FS="[| ]";PROCINFO["sorted_in"]="@ind_str_asc";OFS=";"}/[|]./{for(i=2;i<=NF;i+=1){arr[$i]=$1}}END{for(i in arr){paths[getpath(i)]};for(j in paths){printasrecord(j)}}' file.txt

gives output

;B         
;;F         
;;G         
;C         
;;J         
;;K         
;;;Z         
;;;;X         
;;;;Y         
;;L         
;;;M         
;;;N         
;D         
;E

Explanation: I process file.txt and create array arr where key is child and value is its' parent. Then I use recent recursive function getpath to get all paths as keys of other array, which are presented in fixed-width format, I selected value of 10, however if you have longer than that adjust it accordingly. This allow to use alphabetical order sorting to get desired output. I inform GNU AWK that such ordering is needed by setting PROCINFO["sorted_in"] to @ind_str_asc (see Controlling Scanning for possible values). I traverse paths array in that manner, for each key (path) I parse it using patsplit into parts, then I set current line ($0) to empty string and set depthth field to value of depthth part of path, which is last part. Observe that output is slightly different from desired output presented, as there is not A (which has not parent) and order at same depth is dictated by alphabetical sorting, so X comes before Y.

Disclaimer: my solution assumes you are working with directed, acyclic, connected graph, where each element has at most 1 direct ancestor

(tested in GNU Awk 5.1.0)