Perl Inline::C: Are Inline_Stack_Vars etc. needed to avoid memory leaks (biosequence character matching)

224 views Asked by At

My question(s) relates to working inline C code: is it necessary to use the inline stack functions (Inline_Stack_Vars) to pass variables in and out, or is it appropriate in this context to just modify a variable in place?

For display of biosequence data, I need to show just differences between two aligned strings; e.g. given these two strings:

    ATCAGAAA--GACATGGGCCAAAGATTAA-CAGTGGCCATTGACAGGA--
    --CCCCAACTGACAGGGGGCAAAGATTAA-CAGTGGCCATTG---GGA--

I want to get this (the matching chars in the second string replaced with '.'s.

    --.CCC..CT....G...G..........-............---...--

I have a lot of sequences (millions of Illumina reads), so have turned to inline::c for the character matching. The following inlined code seems to work just fine (changing the second argument to the add_matchchars function in place):

#!/usr/bin/perl
use Inline C;

my($seq1,$seq2) = qw/ ATCAGAAA--GACATGGGCCAAAGATTAA-CAGTGGCCATTGACAGGA--
                      --CCCCAACTGACAGGGGGCAAAGATTAA-CAGTGGCCATTG---GGA-- /;

print $seq1,"\n";
print $seq2,"\n";
add_matchchars($seq1,$seq2);
print $seq2,"\n";

__END__

__C__

void add_matchchars(char *seq1, char *seq2) {
    int seq1char;
    int seq2char;
    while(seq1char = *seq1++ , seq2char = *seq2++) {
        if (seq1char == seq2char) {
            *seq2--;
            if (seq1char != '-') {
                *seq2 = '.';
            }
            *seq2++;
        }
        //printf("%c-%c\n",seq1char,seq2char);
    } 
 // printf("%s\n%s\n",seq1,seq2);
}

But 1) is it reasonably efficient (is there a cleverer/better way)? and 2) will it leak memory?

1

There are 1 answers

3
amon On BEST ANSWER

You should not rely on the char * of a scalar being modifiable, or even being the original buffer of the scalar. Instead, return a new string.

The Inline_Stack_Vars macro is only useful when dealing with a variable number of arguments or multiple return values. Neither is the case here.

Your code does not currently suffer from memory leaks (you don't allocate any memory inside your C function), but there are some issues incl. style, possible segfaults (correct while(seq1char = *seq1++ , seq2char = *seq2++) to while((seq1char = *seq1++) && (seq2char = *seq2++)) and the fact that Perl strings may contain NULs inside the string.

I think it is generally a better idea to have your C function take scalars directly. Roughly:

SV *add_matchchars(SV *seq1_sv, SV *seq2_sv) {
    STRLEN len1, len2;
    char *seq1 = SvPVbyte(seq1_sv, len1);
    char *seq2 = SvPVbyte(seq2_sv, len2);
    STRLEN min_len = len1 < len2 ? len1 : len2;
    SV *seq3_sv = newSVpvn(seq2, min_len);
    char *seq3;
    STRLEN i;

    seq3 = SvPVX(seq3_sv);
    for (i = 0; i < min_len; ++i) {
        if (seq1[i] == seq2[i])
            seq3[i] = '.';
    }

    return seq3_sv;
}