I would like to extract the start and end site information that is in capital letter. By counting the sequence length using the code below is not able to return the sequence information accurately. The P-match result that I need to process given the start site is based on the first alphabet but the start site that I actually need is the first capital letter that occur in every site. How can I retrieve the accurate start and end site? Can anyone help me?
Text file A.txt
Scanning sequence ID: BEST1_HUMAN
150 (-) 1.000 0.997 GGAAAggccc R05891
354 (+) 0.988 0.981 gtgtAGACAtt R06227
V$CREL_01c-RelV$EVI1_05Evi-1
Scanning sequence ID: 4F2_HUMAN
365 (+) 1.000 1.000 gggacCTACA R05884
789 (-) 1.000 1.000 gcgCGAAA R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F
Expected output:
Sequence ID start end
BEST1_HUMAN 150 155
BEST1_HUMAN 358 363
4F2_HUMAN 370 370
4F2_HUMAN 792 797
File B.txt
Scanning sequence ID: hg17_ct_ER_ER_142
512 (-) 0.988 0.981 taTAGCTaagc Evi-1 R06227
V$EVI1_05
Scanning sequence ID: hg17_ct_ER_ER_1
213 (-) 1.000 0.989 aggggcaggGGTCA COUP-TF, HNF-4 R07445
V$COUP_01
Expected output:
hg17_ct_ER_ER_142 514 519
hg17_ct_ER_ER_1 222 227
Example code:
output_file = open('output.bed','w')
with open('A.txt') as f:
text = f.read()
chunks = text.split('Scanning sequence ID:')
for chunk in chunks:
if chunk:
lines = chunk.split('\n')
sequence_id = lines[0].strip()
for line in lines:
if line.startswith(' '):
start = int(line.split()[0].strip())
sequence = line.split()[-2].strip()
stop = start + len(sequence)
#print sequence_id, start, stop
seq='%s\t%i\t%i\n' % \
(sequence_id,start,stop)
output_file.write(seq)
output_file.close()
This code will get the label and start values:
Sample output:
Then there's the calculation of the second number ("end site"). In the code in the opening post I see:
sequence = line.split()[-2].strip(); stop = start + len(sequence)
. Hence I would conclude thatyou want to increment the value start with the string length of the second last column (GGAAAggccc etc.).I can capture that column as well, using the following modified regexp:
Sample output:
Now we want to handle the situation where one label has more than one data line. For this, we need to drop
re.findall
and go to an iteration:Then you can print the matches as follows:
Sample output:
I think the problem is solved ;)