Sometimes long running ssh commands stop printing to stdout

1.9k views Asked by At

I have been using Perl::Net::SSH to automate running some scripts on my remote boxes. However, some of these scripts take a really long time to complete (hour or two) and sometimes, I stop getting data from them, without actually losing the connection.

Here's the code I'm using:

sub run_regression_tests {
    for(my $i = 0; $i < @servers; $i++){
        my $inner = $users[$i];
        foreach(@$inner){
            my $user = $_;
            my $server = $servers[$i];
            
            my $outFile;
            open($outFile, ">" . $outputDir . $user . '@' . $server . ".log.txt");
            print $outFile "Opening connection to $user at $server on " . localtime() . "\n\n";
            close($outFile);
            
            my $pid = $pm->start and next;
            
                print "Connecting to $user\@$server...\n";
                
                my $hasWentToDownloadYet = 0;
                my $ssh = Net::SSH::Perl->new($server, %sshParams);
                $ssh->login($user, $password);              
                
                $ssh->register_handler("stdout", sub {
                    my($channel, $buffer) = @_;             
                    my $outFile;
                    open($outFile, ">>", $outputDir . $user . '@' . $server . ".log.txt");                  
                    print $outFile $buffer->bytes;              
                    close($outFile);                
                    
                    my @lines = split("\n", $buffer->bytes);
                    foreach(@lines){
                        if($_ =~ m/REGRESSION TEST IS COMPLETE/){
                            $ssh->_disconnect();
                            
                            if(!$hasWentToDownloadYet){
                                $hasWentToDownloadYet = 1;
                                print "Caught exit signal.\n";
                                print("Regression tests for ${user}\@${server} finised.\n");
                                download_regression_results($user, $server);
                                $pm->finish;
                            }
                        }
                    }
                    
                });
                $ssh->register_handler("stderr", sub {
                    my($channel, $buffer) = @_;             
                    my $outFile;
                    open($outFile, ">>", $outputDir . $user . '@' . $server . ".log.txt");
                    
                    print $outFile $buffer->bytes;              
                    
                    close($outFile);                
                });
                if($debug){
                    $ssh->cmd('tail -fn 40 /GDS/gds/gdstest/t-gds-master/bin/comp.reg');
                }else{
                    my ($stdout, $stderr, $exit) = $ssh->cmd('. ./.profile && cleanall && my.comp.reg');
                    if(!$exit){
                        print "SSH connection failed for ${user}\@${server} finised.\n";
                    }
                }
                #$ssh->cmd('. ./.profile');
                
                if(!$hasWentToDownloadYet){
                    $hasWentToDownloadYet = 1;
                    print("Regression tests for ${user}\@${server} finised.\n");
                    download_regression_results($user, $server);
                }
                
            $pm->finish;        
        }
    }
    sleep(1);
    print "\n\n\nAll tests started. Tests typically take 1 hour to complete.\n";
    print "If they take significantly less time, there could be an error.\n";
    print "\n\nNo output will be printed until all commands have executed and finished.\n";
    print "If you wish to watch the progress tail -f one of the logs this script produces.\n Example:\n\t" . 'tail -f ./[email protected]' . "\n";
    $pm->wait_all_children;
    print "\n\nAll Tests are Finished. \n";
}

And here is my %sshParams:

my %sshParams = (
    protocol => '2',
    port => '22',
    options => [
        "TCPKeepAlive yes",
        "ConenctTimeout 10",
        "BatchMode yes"
    ]
);

Sometimes randomly one of the long running commands just halts printing/firing the stdout or stderr events and never exits. The ssh connection doesn't die (as far as I'm aware) because the $ssh->cmd is still blocking.

Any idea how to correct this behaviour?

2

There are 2 answers

3
salva On

It fails probably due to the way you look into the output for the REGRESSION TEST IS COMPLETE mark. It may be split over two different SSH packets and so your callback will never found it.

Better, use a remote command that ends when it is done as this one-liner:

perl -pe 'BEGIN {$p = open STDIN, "my.comp.reg |" or die $!}; kill TERM => -$p if /REGRESSION TEST IS COMPLETE/}'

Otherwise, you are closing the remote connection but not stopping the remote process that will stay alive.

Besides that, you should try using Net::OpenSSH or Net::OpenSSH::Parallel instead of Net::SSH::Perl:

use Net::OpenSSH::Parallel;

my $pssh = Net::OpenSSH::Parallel->new;

for my $i (0..$#server) {
    my $server = $server[$i];
    for my $user (@{$users[$ix]}) {
        $pssh->add_host("$user\@$server", password => $password);
    }
}

if ($debug) {
    $pssh->all(cmd => { stdout_file => "$outputDir%USER%\@%HOST%.log.txt",
                        stderr_to_stdout => 1 },
               'fail -fn 40 /GDS/gds/gdstest/t-gds-master/bin/comp.reg');
}
else {
    $pssh->all(cmd => { stdout_file => "$outputDir%USER%\@%HOST%.log.txt",
                        stderr_to_stdout => 1 },
               '. ./.profile && cleanall && my.comp.reg');
}

$pssh->all(scp_get => $remote_regression_results_path, "regression_results/%USER%\@%HOST%/");

$pssh->run;
2
Neil Neely On

In your %sshParams hash, you may need to add "TCPKeepAlive yes" to your options:

$sshParams{'options'} = ["BatchMode yes", "TCPKeepAlive yes"];

Those options might or might not be right for you, but the TCPKeepAlive is something I would recommend setting for any long running SSH connection. If you have any kind of stateful firewall in your path it could drop the state if it hasn't passed traffic over the connection for a long period of time.