How to extract start and end sites based on capital letter in a sequence?

98 views Asked by At

I would like to extract the start and end site information that is in capital letter. By counting the sequence length using the code below is not able to return the sequence information accurately. The P-match result that I need to process given the start site is based on the first alphabet but the start site that I actually need is the first capital letter that occur in every site. How can I retrieve the accurate start and end site? Can anyone help me?

Text file A.txt

Scanning sequence ID:   BEST1_HUMAN

          150 (-)  1.000  0.997  GGAAAggccc                                   R05891
          354 (+)  0.988  0.981  gtgtAGACAtt                                  R06227
V$CREL_01c-RelV$EVI1_05Evi-1

Scanning sequence ID:   4F2_HUMAN

          365 (+)  1.000  1.000  gggacCTACA                                   R05884
           789 (-)  1.000  1.000  gcgCGAAA                                       R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F

Expected output:

Sequence ID start end

BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN   370 370
4F2_HUMAN   792 797

File B.txt

Scanning sequence ID: hg17_ct_ER_ER_142

              512 (-)  0.988  0.981  taTAGCTaagc                        Evi-1          R06227
V$EVI1_05

Scanning sequence ID: hg17_ct_ER_ER_1

              213 (-)  1.000  0.989  aggggcaggGGTCA                     COUP-TF, HNF-4 R07445
V$COUP_01

Expected output:

hg17_ct_ER_ER_142 514 519
hg17_ct_ER_ER_1 222 227

Example code:

output_file = open('output.bed','w')
with open('A.txt') as f:
    text = f.read()
    chunks = text.split('Scanning sequence ID:')
    for chunk in chunks:
        if chunk:
            lines = chunk.split('\n')
            sequence_id = lines[0].strip()
            for line in lines:
                if line.startswith('              '):
                    start = int(line.split()[0].strip())
                    sequence = line.split()[-2].strip()
                    stop = start + len(sequence)
                    #print sequence_id, start, stop
                    seq='%s\t%i\t%i\n' % \
                         (sequence_id,start,stop)
                    output_file.write(seq)
output_file.close()
1

There are 1 answers

5
Freek Wiekmeijer On BEST ANSWER

This code will get the label and start values:

import re

p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)"

with open("A.txt", "r") as f:
    s = f.read()

re.findall(p,s, re.DOTALL)

Sample output:

[('BEST1_HUMAN', '150'), ('4F2_HUMAN', '365')]

Then there's the calculation of the second number ("end site"). In the code in the opening post I see: sequence = line.split()[-2].strip(); stop = start + len(sequence). Hence I would conclude thatyou want to increment the value start with the string length of the second last column (GGAAAggccc etc.).

I can capture that column as well, using the following modified regexp:

p = "Scanning sequence ID\:\s*(?P<label>[A-Z0-9]+\_[A-Z0-9]+).*?(?P<start_value>\d+)\s+\S+\s+\S+\s+\S+\s+(?P<sequence>\S+)"
re.findall(p,s, re.DOTALL)

Sample output:

[('BEST1_HUMAN', '150', 'GGAAAggccc'), ('4F2_HUMAN', '365', 'gggacCTACA')]

Now we want to handle the situation where one label has more than one data line. For this, we need to drop re.findall and go to an iteration:

import re
with open("A.txt", "r") as f:
    lines = f.readlines()

label_ptrn = re.compile("^Scanning sequence ID\\:\\s*(?P<label>[A-Z0-9]+\\_[A-Z0-9]+)$")
line_ptrn = re.compile("^\s+(?P<start_value>\\d+)\\s+\\S+\\s+\\S+\\s+\\S+\\s+(?P<sequence>\\S+).*$")
inner_ptrn = re.compile("[A-Z]+")

all_matches = []
for line in lines:
    m = label_ptrn.match(line)
    if m:
        label = m.groupdict().get("label")
        continue
    m = line_ptrn.match(line)
    if m:
        start = m.groupdict().get("start_value")
        sequence = m.groupdict().get("sequence")
        mi = inner_ptrn.search(sequence)
        if not mi:
            continue
        span = mi.span()
        all_matches.append((label, int(start)+span[0], int(start)+span[1]))

Then you can print the matches as follows:

with open("output.bed", "w+b") as f:
    for m in all_matches:
        f.write('%s\t%i\t%i\n' % m)

Sample output:

BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN   370 375
4F2_HUMAN   792 797

I think the problem is solved ;)