Python Pandas filter rows based on the string value of an entry

1.6k views Asked by At

I have an excel sheet (Bloomberg Data License output) I read in with

import pandas as pd
raw_data = pd.read_excel('my-file.xlsx')

There is one column (START-OF-FILE) and a varying number rows, depending on the amount of data returned.

I am interested in the data between two rows, specifically START-OF-DATA and END-OF-DATA. The rows in the column look like

19                                      START-OF-DATA
20  WTS Equity|0|6|WTS|50545|54.440000|54.000000|5...
21  XOM Equity|0|6|XOM|6555175|84.950000|85.300000...
22  SUP Equity|0|6|SUP|27405|19.250000|19.200000|1...
23                                        END-OF-DATA

with a varying number of rows, (not always 20 to 22). How can I filter the rows in the column to only the data between the cells, ie raw_data['START-OF-FILE']['START-OF-DATA' : 'END-OF-DATA']. and then use str.split('|') to seperate the pipe delimited data into seperate columns for a new dataframe?

2

There are 2 answers

0
gbrener On BEST ANSWER
>>> import pandas as pd
>>> df = pd.DataFrame(['abcdef', 'START-OF-DATA', 'g|h|i', 'j|k|l', 'm|n|o', 'END-OF-DATA', 'pqrstu', columns=['A']])
>>> df
               A
0         abcdef
1  START-OF-DATA
2          g|h|i
3          j|k|l
4          m|n|o
5    END-OF-DATA
6         pqrstu
>>> start, end = df[df['A'].str.contains('(START|END)-OF-DATA')].index.tolist()
>>> pd.DataFrame(df[start+1:end]['A'].str.split('|').tolist(), columns=['A', 'B', 'C'])
   A  B  C
2  g  h  i
3  j  k  l
4  m  n  o
0
kennes On

I wrote a function to do this for you:

def select_data(data, column, start, stop):

    store = []

    for idx, item in enumerate(data[column]):
        if item in [start, stop]:
            store.append(idx)

    output = pd.DataFrame(list(data[column][store[0]+1:store[1]].str.split('|')))

    return output

Output:

In [231]: df = pd.DataFrame(['a|b|c|d']*20, columns=['test-col']);

In [232]: df.ix[5, ['test-col']] = 'START'

In [233]: df.ix[17, ['test-col']] = 'STOP'

In [234]: df
Out[234]: 
   test-col
0   a|b|c|d
1   a|b|c|d
2   a|b|c|d
3   a|b|c|d
4   a|b|c|d
5     START
6   a|b|c|d
7   a|b|c|d
8   a|b|c|d
9   a|b|c|d
10  a|b|c|d
11  a|b|c|d
12  a|b|c|d
13  a|b|c|d
14  a|b|c|d
15  a|b|c|d
16  a|b|c|d
17     STOP
18  a|b|c|d
19  a|b|c|d

In [235]: test = filter_data(df, 'test-col', 'START','STOP')

In [236]: test
Out[236]: 
    0  1  2  3
0   a  b  c  d
1   a  b  c  d
2   a  b  c  d
3   a  b  c  d
4   a  b  c  d
5   a  b  c  d
6   a  b  c  d
7   a  b  c  d
8   a  b  c  d
9   a  b  c  d
10  a  b  c  d

To follow up here, I ran some time tests on both functions. Here are the results:

In [1]: import pandas as pd

In [2]: import timeit

In [3]: df = pd.DataFrame(['a|b|c|d']*10000, columns=['test-col'])

In [4]: df.ix[2654, ['test-col']] = 'START'

In [5]: df.ix[9000, ['test-col']] = 'STOP'

In [6]: %paste
def func_1(data, column, start, stop):

        store = []

        for idx, item in enumerate(data[column]):
                if item in [start, stop]:
                        store.append(idx)

        output = pd.DataFrame(list(data[column][store[0]+1:store[1]].str.split('|')))

        return output


def func_2(df):

        start, end = df[df['test-col'].str.contains('START|STOP')].index.tolist()
        output = pd.DataFrame(df[start+1:end]['test-col'].str.split('|').tolist())

        return output

## -- End pasted text --

In [7]: %timeit func_1(df, 'test-col', 'START','STOP')
100 loops, best of 3: 8.95 ms per loop

In [8]: %timeit func_2(df)
100 loops, best of 3: 13.3 ms per loop

My hypothesis why func_1() is faster than func_2() is because of the .contains() method which uses regex to pattern match.