Valid characters in a String

2.1k views Asked by At

I am given a string and have to return False if there is one or more invalid characters, otherwise True. The caveat is that I can only built-in functions and str operations (for example: in, +, indexing, len) and recursion. What I have so far is not working:

def is_valid_sequence(dna):
""" (str) -> bool

Return True if and only if the DNA sequence is valid
(A, T, C, and G)
:param dna: string sequence
:return: true or false
>>> is_valid_sequence('ATTAC')
True
>>> is_valid_sequence('FBSSS')
False
"""
valid_seq = 0

if len(dna) == 1 and dna in ['A', 'T', 'C', 'G']:
        valid_seq += 1
else:
    return is_valid_sequence(dna[1:])

Obviously, this code doesn't work because of recursion and adding 1 to the valid_seq variable just gets wiped after the next recursive iteration.

3

There are 3 answers

1
Mulan On BEST ANSWER

For DNA sequences of a small size (around a thousand characters), this is a practical implementation

def is_valid_sequence (dna):
  # base case
  if len (dna) == 0:
    return True
  # check if first character is valid
  elif dna [0] not in ['A', 'T', 'C', 'G']:
    return False
  # otherwise, recurse on the rest of the characters
  else:
    return is_valid_sequence (dna [1:])

print (is_valid_sequence ('AATGCGA'))  # True
print (is_valid_sequence ('AATXGCGA')) # False

A word of caution

Be careful with recursion in python tho – a long dna string will cause a stack overflow. Attempting to validate even a sequence this "large" would fail

GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA

You can easily avoid this by implementing is_valid_sequence using a Clojure-style loop/recur mechanism for constant-space recursion

def recur (*values):
  return (recur, values)

def loop (f):
  acc = f ()
  while type (acc) is tuple and acc [0] is recur:
    acc = f (*acc [1])
  return acc

def is_valid_sequence (dna):
  # stack-safe loop/recur implementation
  # initialize loop state variable `s` to value of `dna`
  return loop (lambda s = dna:
    # base case
    True if len (s) == 0 else
    # check if first character valid
    False if s [0] not in ['A', 'T', 'C', 'G'] else
    # else recurse on the rest of the characters
    recur (s [1:]))

# does not overflow stack
print (is_valid_sequence ('GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA'))
# True

Continued improvements

The loop/recur implementation exposes an additional opportunity to tune the performance of our function. Slicing the string as we have done with dna[0] and dna[1:] creates new strings in memory; this was only necessary because of the recursion API used in the first function we wrote

The loop/recur interface allows us to use whatever data type is suitable for computing our output – in this case, a simple integer index will do. Lexical scoping takes care of the rest – dna is accessible within our lambda and does not need to do dna[1:] slicing which will save a lot of time/space for large inputs

def is_valid_sequence (dna):
  # only create this array once
  valid_chars = ['A', 'T', 'C', 'G']
  # initialize loop state variable `i` with 0
  return loop (lambda i = 0:
    True if len (dna) == i else
    False if dna [i] not in valid_chars else
    recur (i + 1))

Python and the Unruly Lambda

Notice how we were forced to use a pure expression inside the lambda instead of the traditional if/elif/else statement syntax – this is not a problem for simple programs, but more complex programs might be difficult to express this way in python.

This works identically to the program above but uses a plain old function instead of a lambda

def is_valid_sequence (dna):
  valid_chars = ['A', 'T', 'C', 'G']
  # plain old function; replaces lambda
  def myloop (i = 0):
    if len (dna) == 0:
      return True
    elif dna [i] not in valid_chars:
      return False
    else:
      return recur (i + 1)
  # use plain old function instead of lambda
  return loop (myloop)
0
Miket25 On

As others have said, your recursive function does not return when it hits the end of the recursion. You can say something like:

if len(dna) == 1 and dna in ['A','T','G','C']:
    return True

That is, if you know your dna string is always greater than or equal to one in length. All together you might end up with something like:

def is_vaild_sequence(dna):
    if dna[0] not in ['A','T','G','C']:
        return False
    if len(dna) == 1:
        return True
    return is_vaild_sequence(dna[1:])

Here, the first check determines if the dna's first character is valid, followed by the recursive structure to check the whole sequence.

Ideally, if this can be solved without the constraint of recursion, take it. An iterative approach is far better i.e.

for i in range(0,len(dna)):
    if dna[i] not in ['A','T','G','C']:
        return False
return True
0
Gwang-Jin Kim On

The most pythonic solution is:

def is_valid_dna(seq):        
    for x in seq:            # don't use index here
        if x not in "ACGT":  # simplified check
            return False
    return True

Interesting would be, whether to use regex would be performance-wise faster:

import re

def is_valid_dna(seq):
    pattern = re.compile("^[ACGT]*$")
    return pattern.match(seq) is not None

I guess the negative match search is faster:

import re

def is_valid_dna(seq): 
     pattern = re.compile("[^ACGT]") 
     return pattern.search(seq) is None