Unable to extract necessary information from my .txt file

54 views Asked by At

0 .17 .29 d ih
1 .29 .73 k l ay n d
1 .73 .84 g ih

This is a sample of the .txt file that I am working on.

I have tried using the np.loadtxt() to extract the last column,

    syl_array = []
    try:
        fid = open(syl_file, 'r')
        syl_array = np.loadtxt(fid, usecols=(0, 1, 2, 3), dtype={'names': ('a', 'b', 'c', 'd'), 'formats': ('i4', 'f4', 'f4', 'U10')})
        fid.close
    except:
        print('File does not exist')
        return

    labels = syl_array['a']
    spurtStartTimes = syl_array['b']
    spurtEndTimes = syl_array['c']
    syllables = syl_array['d']

This code gives the following output,

--['d' 'k' 'g']--


But the output I want is,

--['d ih', 'k l ay n d', 'g ih']--


I want each group of syllables from the same row to be one element in the array. How do I achieve this?

1

There are 1 answers

0
JarbingleMan On

If you have control over how the file itself is generated, what you are missing is a meaningful delimiter. The problem here is that there is no way for any standard parser to know that the space between 0 and .17 means you want those values to be in different column, whereas the space between d and ih does NOT mean this.

If you replace the spaces that represent columns with delimiters other than space (i.e. comma or tab), you can get numpy to do what you want.

"""
syl_file contents:
0\t.17\t.29\td ih
1\t.29\t.73\tk l ay n d
1\t.73\t.84\tg ih
"""
arr = np.loadtxt(
    syl_file,
    delimiter="\t",
    dtype=dict(
        names=('a','b','c','d'),
        formats=('i4','f4','f4','U10')
    )
)
print(arr)
"""
Output:
array(
    [
        (0, 0.17, 0.29, 'd ih'),
        (1, 0.29, 0.73, 'k l ay n d'),
        (1, 0.73, 0.84, 'g ih')
    ],
    dtype=[('a', '<i4'), ('b', '<f4'), ('c', '<f4'), ('d', '<U10')]
)
"""

However, if you truly have no control over how syl_file is generated, then you will need to write your own custom parser. Depending on how big the file is, you could write something as simple as:

rows = []
with open("/tmp/tmp.txt") as f:
    for row in f.readlines():
        if row.strip() == "":
            continue
        parsed = row.split()
        parsed[0] = int(parsed[0])
        parsed[1:3] = map(float, parsed[1:3])
        parsed[3] = " ".join(parsed[3:]) # Combine the remaining columns into a single value
        rows.append(parsed[:4]) # Our result is in the first 4 columns!

arr = np.array(rows, dtype=object)