Data Type Validation in UNIX

755 views Asked by At

I need to validate the file with respect to the data types. I have a file with below data,

data.csv

Col1 | Col2 | Col3 | Col4
100  | XYZ  | 200  | 2020-07-11
200  | XYZ  | 500  | 2020-07-10
300  | XYZ  | 700  | 2020-07-09

I have another file having the configurations,

Config_file.txt

Columns   = Col1|Col2|Col3|Col4
Data_type = numeric|string|numeric|date
Delimiter = |

I have to compare the configuration file and data file and return a result.
For example: In configuration file data_type of Col1 is numeric. In case if i get any string value in Col1 in data file, the script should return Datatype Mismatch Found in Col1. I have tried with awk, if its one line item its easy to get it done by defining the position of the columns. But am not sure how to loop entire file column by column ad check the data.

I have also tried providing the patterns and achieve this. But am unable to validate complete file. Any suggestion would be helpful.

awk -F "|" '$1 ~ "^[+-]?[0-9]+([.][0-9]+)?$" && $4 ~ "^[+-]?[0-9]+([.][0-9]+)?$" && length($5) == 10 {print}' data.csv

The goal is to compare the data file (data.csv) and Data_Type in config file(Config_file.txt) for each column and check if any column is having datatype mismatch.

For example, consider below data

Col1 | Col2 | Col3 | Col4
100  | XYZ  | 200  | 2020-07-11
ABC  | XYZ  | 500  | 2020-07-10  -- This is incorrect data because Col1 is having string value `ABC`, in config file, the data type is numeric
300  | XYZ  | 700  | 2020-07-09
300  | XYZ  | 700  | 2020-07-09
300  | XYZ  | XYZ  | 2020-07-09 -- Incorrect Data
300  | 300  | 700  | 2020-07-09
300  | XYZ  | 700  | XYX        -- Incorrect Data  

The data type provided in config table is as below,

Columns   = Col1|Col2|Col3|Col4
Data_type = numeric|string|numeric|date

The script should echo the result as Data Type Mismatch Found in Col1

2

There are 2 answers

1
James Brown On

Here is a skeleton solution in GNU awk. In lack of sample output I improvised:

awk '
BEGIN {
    FS=" *= *"
}
function numeric(p) {                # testing for numeric
    if(p==(p+0))
        return 1
    else return 0
}
function string(p) {                 # cant really fail string test, right
    return 1
}
function date(p) {
    gsub(/-/," ",p)
    if(mktime(p " 0 0 0")>=0)
        return 1
    else return 0
}
NR==FNR{                             # process config file
    switch($1) {
        case "Columns":
            a["Columns"]=$NF;
            break
        case "Data_type":
            a["Data_type"]=$NF;
            break
        case "Delimiter":
            a["Delimiter"]=$NF;
    }
    if(a["Columns"] && a["Data_type"] &&  a["Delimiter"]) {
        split(a["Columns"],c,a["Delimiter"])
        split(a["Data_type"],d,a["Delimiter"])
        for(i in c) {                # b["Col1"]="string" etc.
            b[c[i]]=d[i]
            FS= a["Delimiter"]
        }
    }
    next
}
FNR==1{                              # processing headers of data file
    for(i=1;i<=NF;i++) {
        h[i]=$i                      # h[1]="Col1" etc.
    }
}
{
    for(i=1;i<=NF;i++) {              # process all fields
        f=b[h[i]]                     # using indirect function calls check
        printf "%s%s",(@f($i)?$i:"FAIL"),(i==NF?ORS:FS)  # the data
    }
}' config <(tr -d \  <data)  # deleting space from your data as "|"!="  |  "

Sample output:

FAIL|Col2|FAIL|FAIL
100|XYZ|200|2020-07-11
200|XYZ|500|2020-07-10
300|XYZ|700|2020-07-09
FAIL|XYZ|FAIL|FAIL           # duplicated previous record and malformed it
1
Ed Morton On
$ cat tst.awk
NR == FNR {
    gsub(/^[[:space:]]+|[[:space:]]+$/,"")
    tag = val = $0
    sub(/[[:space:]]*=.*/,"",tag)
    sub(/[^=]+=[[:space:]]*/,"",val)
    cfg_tag2val[tag] = val
    next
}

FNR == 1 {
    FS = cfg_tag2val["Delimiter"]
    $0 = $0
    reqd_NF = split(cfg_tag2val["Columns"],reqd_names)
    split(cfg_tag2val["Data_type"],reqd_types)
}

NF != reqd_NF {
    printf "%s: Error: line %d NF (%d) != required NF (%d)\n", FILENAME, FNR, NF, reqd_NF | "cat>&2"
    got_errors = 1
}

FNR == 1 {
    for ( i=1; i<=NF; i++ ) {
        reqd_name = reqd_names[i]
        name = $i
        gsub(/^[[:space:]]+|[[:space:]]+$/,"",name)
        if ( name != reqd_name ) {
            printf "%s: Error: line %d col %d name (%s) != required col name (%s)\n", FILENAME, FNR, i, name, reqd_name | "cat>&2"
            got_errors = 1
        }
    }
}

FNR > 1 {
    for ( i=1; i<=NF; i++ ) {
        reqd_type = reqd_types[i]
        if ( reqd_type != "string" ) {
            value = $i
            gsub(/^[[:space:]]+|[[:space:]]+$/,"",value)
            type = val2type(value)
            if ( type != reqd_type ) {
                printf "%s: Error: line %d field %d (%s) type (%s) != required field type (%s)\n", FILENAME, FNR, i, value, type, reqd_type | "cat>&2"
                got_errors = 1
            }
        }
    }
}

END { exit got_errors }

function val2type(val,  type) {
    if      ( val == val+0 )                     { type = "numeric" }
    else if ( val ~ /^[0-9]{4}(-[0-9]{2}){2}$/ ) { type = "date" }
    else                                         { type = "string" }
    return type
}

.

$ awk -f tst.awk config.txt data.csv
data.csv: Error: line 3 field 1 (ABC) type (string) != required field type (numeric)
data.csv: Error: line 6 field 3 (XYZ) type (string) != required field type (numeric)
data.csv: Error: line 8 field 4 (XYX) type (string) != required field type (date)