How to get selective data from a file in TCL?

2.6k views Asked by At

I am trying to parse selective data from a file based on certain key words using tcl,for example I have a file like this

...
...
..
... 
data_start
30 abc1 xyz 
90 abc2 xyz 
214 abc3 xyz
data_end
...
...
...

How do I catch only the 30, 90 and 214 between "data_start" and "data_end"? What I have so far(tcl newbie),

proc get_data_value{ data_file } {

set lindex 0
set fp [open $data_file r]
set filecontent [read $fp]



while {[gets $filecontent line] >= 0} {

if { [string match "data_start" ]} {

    #Capture only the first number? 
    #Use regex? or something else? 

        if { [string match "data_end" ] } {

            break
        } else {

            ##Do Nothing?
        }
    }
 }
close $fp
}
4

There are 4 answers

1
Dinesh On BEST ANSWER

If your file is smaller in size, then you can use read command to slurp the whole data into a variable and then apply regexp to extract the required information.

input.txt

data_start
30 abc1 xyz 
90 abc2 xyz 
214 abc3 xyz
data_end
data_start
130 abc1 xyz 
190 abc2 xyz 
1214 abc3 xyz
data_end

extractNumbers.tcl

set fp [open input.txt r]
set data [read $fp]
close $fp
set result [regexp -inline -all {data_start.*?\n(\d+).*?\n(\d+).*?\n(\d+).*?data_end} $data]
foreach {whole_match number1 number2 number3} $result {
    puts "$number1, $number2, $number3"
}

Output :

30, 90, 214
130, 190, 1214

Update :

Reading a larger file's content into a single variable will cause the program to crash depends on the memory of your PC. When I tried to read a file of size 890MB with read command in a Win7 8GB RAM laptop, I got unable to realloc 531631112 bytes error message and tclsh crashed. After some bench-marking found that it is able to read a file with a size of 500,015,901 bytes. But the program will consume 500MB of memory since it has to hold the data.

Also, having a variable to hold this much data is not efficient when it comes to extracting the information via regexp. So, in such cases, it is better to go ahead with read the content line by line.

Read more about this here.

0
wolfhammer On

Load all the data from the file into a variable. Set start and end tokens and seek to those positions. Process the item line by line. Tcl uses lists of strings separated by white space so we can process the items in the line with foreach {a b c} $line {...}.

tcl:

set data {...
...
..
... 
data_start
30 abc1 xyz 
90 abc2 xyz 
214 abc3 xyz
data_end
...
...
...}


set i 0
set start_str "data_start"
set start_len [string length $start_str]
set end_str "data_end"
set end_len [string length $end_str]

while {[set start [string first $start_str $data $i]] != -1} {
    set start [expr $start + $start_len]
    set end [string first $end_str $data $start]
    set end [expr $end - 1]  
    set item [string range $data $start $end]
    set lines [split $item "\n"]

    foreach {line} $lines {
        foreach {a b c} $line {
            puts "a=$a, b=$b, c=$c"
        }
    }

    set i [expr $end + $end_len]
}

output:

a=30, b=abc1, c=xyz
a=90, b=abc2, c=xyz
a=214, b=abc3, c=xyz
1
glenn jackman On

I'd write that as

set fid [open $data_file]
set p 0
while {[gets $fid line] != -1} {
    switch -regexp -- $line { 
        {^data_end}   {set p 0} 
        {^data_start} {set p 1} 
        default {
            if {$p && [regexp {^(\d+)\M} $line -> num]} {
                lappend nums $num
            }
        }
    }
}
close $fid
puts $nums

or, even

set nums [exec sed -rn {/data_start/,/data_end/ {/^([[:digit:]]+).*/ s//\1/p}} $data_file]
puts $nums
0
Mikhail T. On

My favorite method would be to declare procs for each of the acceptable tokens and utilize the unknown mechanism to quietly ignore the unacceptable ones.

proc 30 args {
    ... handle 30 $args
}

proc 90 args {
    ... process 90 $args
}

rename unknown original_unknown
proc unknown args {
    # This space was deliberately left blank
}

source datafile.txt
rename original_unknown unknown

You'll be using Tcl's built-in parsing, which should be considerably faster. It also looks better in my opinion.

You can also put the line-handling logic into your unknown-procedure entirely:

rename unknown original_unknown
proc unknown {first args} {
    process $first $args
}
source input.txt
rename original_unknown unknown

Either way, the trick is that Tcl's own parser (implemented in C) will be breaking up the input lines into tokens for you -- so you don't have to implement the parsing in Tcl yourself.

This does not always work -- if, for example, the input is using multi-line syntax (without { and }) or if the tokens are separated with something other than white space. But in your case it should do nicely.