Get a table from a print output (pandas)

206 views Asked by At

I ran a programme called codeml implemented in the python package ete3.

Here is the print of the model generated by codeml :

>>> print(model)
 Evolutionary Model fb.cluster_03502:
        log likelihood       : -35570.938479
        number of parameters : 23
        sites inference      : None
        sites classes        : None
        branches             : 
        mark: #0  , omega: None      , node_ids: 8   , name: ROOT
        mark: #1  , omega: 789.5325  , node_ids: 9   , name: EDGE
        mark: #2  , omega: 0.005     , node_ids: 4   , name: Sp1
        mark: #3  , omega: 0.0109    , node_ids: 6   , name: Seq1
        mark: #4  , omega: 0.0064    , node_ids: 5   , name: Sp2
        mark: #5  , omega: 865.5116  , node_ids: 10  , name: EDGE
        mark: #6  , omega: 0.005     , node_ids: 7   , name: Seq2
        mark: #7  , omega: 0.0038    , node_ids: 11  , name: EDGE
        mark: #8  , omega: 0.067     , node_ids: 2   , name: Sp3
        mark: #9  , omega: 999.0     , node_ids: 12  , name: EDGE
        mark: #10 , omega: 0.1165    , node_ids: 3   , name: Sp4
        mark: #11 , omega: 0.1178    , node_ids: 1   , name: Sp5

But since it is only a print, I would need to get these informations into a table such as :

Omega       node_ids       name 
None        8              ROOT
789.5325    9              EDGE
0.005       4              Sp1
0.0109      6              Seq1
0.0064      5              Sp2
865.5116    10             EDGE
0.005       7              Sp3
0.0038      11             EDGE
0.067       2              Sp3
999.0       12             EDGE
0.1165      3              Sp4
0.1178      1              Sp5

Because I need to parse these informations.

Do you have an idea how to handle a print output ?

Thanks for your help.

3

There are 3 answers

3
Georg M. On BEST ANSWER

I took a look at the underlying code in model.py

It seems that you can use s = model.__str__() to obtain a string of this print-out. From there you can parse the string using standard string operations. I don't know the exact form of your string, but your code could look something like this:

import pandas as pd

lines = s.split('\\n')

lst = []
first_idx = 6  # Skip the lines that are not of interest.
names = [field[:field.index(':')].strip() for field in lines[first_idx].split(',')]

for line in lines[first_idx:]:  
    if line:
        row = [field[field.index(':')+1:].strip().strip("#") for field in line.split(',')]
        lst.append(row)

df = pd.DataFrame(lst, columns=names)

There are prettier ways to do this, but it gets the job done.

1
Jonathan Scholbach On

There are two problems with implicit assumptions in your question:

Why print?

Why do you print the model in the first place? This is not a good way to access internals of the model programmatically, because this is made for being read by humans, and you cannot be sure whether maybe some information of the model is omitted in its __str__() method which is used for printing. You have to find out how the Evolutionary Model is structured, turn this structure into a dictionary and create a dataframe from this dictionary, using pandas.DataFrame.from_dict, I would say.

Start with taking a look at model.__dict__() and model.__repr__().

If you can have a look at the code that defines Evolutionary Model, you can of course look up the structure of Evolutionary Model directly and turn it into a dictionary.

Why dataframe?

If you just want to "parse" the model, so if you just want to gain programmatic access to its attributes, it is a lot of extra work to put this into a dataframe. Just access the attributes directly, for instance model.branches if you want to get the value of the branches attribute of the model.

2
Lambda On

You can use StringIO and applymap

from io import StringIO
import pandas as pd

df = pd.read_csv(StringIO(model.__repr__()), skiprows=6, names=['mark', 'omega', 'node_ids', 'name'])
df = df.applymap(lambda x: x.split(":")[1])

Output:

    mark    omega       node_ids    name
0   #0      None        8           ROOT
1   #1      789.5325    9           EDGE
2   #2      0.005       4           Sp1
3   #3      0.0109      6           Seq1
4   #4      0.0064      5           Sp2
5   #5      865.5116    10          EDGE
6   #6      0.005       7           Seq2
7   #7      0.0038      11          EDGE
8   #8      0.067       2           Sp3
9   #9      999.0       12          EDGE
10  #10     0.1165      3           Sp4
11  #11     0.1178      1           Sp5