Parse multicolumn string using python

105 views Asked by At

I'm trying to extract data from the text output of a cheminformatics program called NWChem, I've already extraced the part of the output that I'm interested in(the vibrational modes), here is the string that I have extracted:

s = '''                   1           2           3           4           5           6

 P.Frequency       -0.00        0.00        0.00        0.00        0.00        0.00

           1    -0.23581     0.00000     0.00000     0.00000     0.01800    -0.04639
           2     0.00000     0.25004     0.00000     0.00000     0.00000     0.00000
           3    -0.00000     0.00000     0.00000     0.00000    -0.21968    -0.08522
           4    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           5     0.00000     0.00000     0.99611     0.00000     0.00000     0.00000
           6     0.00192     0.00000     0.00000     0.00000    -0.42262     0.43789
           7    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           8     0.00000     0.00000     0.00000     0.99611     0.00000     0.00000
           9    -0.00193     0.00000     0.00000     0.00000    -0.01674    -0.60834

                    7           8           9

 P.Frequency     1583.30     3661.06     3772.30

           1    -0.00000    -0.00000     0.06664
           2     0.00000     0.00000     0.00000
           3    -0.06754     0.04934     0.00000
           4     0.41551     0.56874    -0.52878
           5     0.00000     0.00000     0.00000
           6     0.53597    -0.39157     0.42577
           7    -0.41551    -0.56874    -0.52878
           8     0.00000     0.00000     0.00000
           9     0.53597    -0.39157    -0.42577'''

First I split the data on rows with a regex.

import re
p = re.compile('\n + +(?=[\d| ]+\n\n P.Frequency +)')
d = re.split(p, s)
print(d[0])

                   1           2           3           4           5           6

 P.Frequency       -0.00        0.00        0.00        0.00        0.00        0.00

           1    -0.23581     0.00000     0.00000     0.00000     0.01800    -0.04639
           2     0.00000     0.25004     0.00000     0.00000     0.00000     0.00000
           3    -0.00000     0.00000     0.00000     0.00000    -0.21968    -0.08522
           4    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           5     0.00000     0.00000     0.99611     0.00000     0.00000     0.00000
           6     0.00192     0.00000     0.00000     0.00000    -0.42262     0.43789
           7    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           8     0.00000     0.00000     0.00000     0.99611     0.00000     0.00000
           9    -0.00193     0.00000     0.00000     0.00000    -0.01674    -0.60834

However I can't figure out how I can extract the vibrational modes that are presented vertically. I would like to get access easily to each vibrational mode in an array of array, or maybe a numpy array. like this:

[[-0.00, -0.23581, 0.0000, ..., -0.00193],
 [0.00, 0.00000, ..., 0.00000],
  ...
 [3772.30, 0.06664, ..., 0.0000, --0.42577]]
2

There are 2 answers

0
hpaulj On BEST ANSWER

With 2 np.genfromtxt reads I can load your data file into 2 arrays, and concatenate them into one 9x9 array:

In [134]: rows1 = np.genfromtxt('stack30874236.txt',names=None,skip_header=4,skip_footer=10)

In [135]: rows2 =np.genfromtxt('stack30874236.txt',names=None,skip_header=17)

In [137]: rows=np.concatenate([rows1[:,1:],rows2[:,1:]],axis=1)

In [138]: rows
Out[138]: 
array([[-0.23581,  0.     ,  0.     ,  0.     ,  0.018  , -0.04639, -0.     , -0.     ,  0.06664],
       [ 0.     ,  0.25004,  0.     ,  0.     ,  0.     ,  0.     , 0.     ,  0.     ,  0.     ],
       ...
       [-0.00193,  0.     ,  0.     ,  0.     , -0.01674, -0.60834, 0.53597, -0.39157, -0.42577]])

In [139]: rows.T
Out[139]: 
array([[-0.23581,  0.     , -0.     , -0.23425,  0.     ,  0.00192,  -0.23425,  0.     , -0.00193],
       [ 0.     ,  0.25004,  0.     ,  0.     ,  0.     ,  0.     ,
       ...
       [ 0.06664,  0.     ,  0.     , -0.52878,  0.     ,  0.42577, -0.52878,  0.     , -0.42577]])

I had to choose the skip header/footer values to fit the datafile. Deducing them with code would take some more work.

2
Ben On

As hpaulj suggested, the numpy function genfromtxt, is very handy to parse such strings, however as I'm using python3, I need to convert my string to a bytes stream to pass it to this function.

Here's the code that did the trick:

import numpy as np
from io import BytesIO
i = 0
for row in d:
    values = np.genfromtxt(BytesIO(row.encode(encoding='UTF-8')), skip_header=1).transpose()[1:]
    if i == 0:
        data = values
    else:
        data = np.concatenate((data, values))
    i += 1