Python Pandas filter rows based on the string value of an entry

Question

Python Pandas filter rows based on the string value of an entry

1.6k views Asked by idclark At 19 June 2015 at 22:41

I have an excel sheet (Bloomberg Data License output) I read in with

import pandas as pd
raw_data = pd.read_excel('my-file.xlsx')

There is one column (START-OF-FILE) and a varying number rows, depending on the amount of data returned.

I am interested in the data between two rows, specifically START-OF-DATA and END-OF-DATA. The rows in the column look like

19                                      START-OF-DATA
20  WTS Equity|0|6|WTS|50545|54.440000|54.000000|5...
21  XOM Equity|0|6|XOM|6555175|84.950000|85.300000...
22  SUP Equity|0|6|SUP|27405|19.250000|19.200000|1...
23                                        END-OF-DATA

with a varying number of rows, (not always 20 to 22). How can I filter the rows in the column to only the data between the cells, ie raw_data['START-OF-FILE']['START-OF-DATA' : 'END-OF-DATA']. and then use str.split('|') to seperate the pipe delimited data into seperate columns for a new dataframe?

Original Q&A

There are 2 answers

kennes On 20 June 2015 at 04:54

I wrote a function to do this for you:

def select_data(data, column, start, stop):

    store = []

    for idx, item in enumerate(data[column]):
        if item in [start, stop]:
            store.append(idx)

    output = pd.DataFrame(list(data[column][store[0]+1:store[1]].str.split('|')))

    return output

Output:

In [231]: df = pd.DataFrame(['a|b|c|d']*20, columns=['test-col']);

In [232]: df.ix[5, ['test-col']] = 'START'

In [233]: df.ix[17, ['test-col']] = 'STOP'

In [234]: df
Out[234]: 
   test-col
0   a|b|c|d
1   a|b|c|d
2   a|b|c|d
3   a|b|c|d
4   a|b|c|d
5     START
6   a|b|c|d
7   a|b|c|d
8   a|b|c|d
9   a|b|c|d
10  a|b|c|d
11  a|b|c|d
12  a|b|c|d
13  a|b|c|d
14  a|b|c|d
15  a|b|c|d
16  a|b|c|d
17     STOP
18  a|b|c|d
19  a|b|c|d

In [235]: test = filter_data(df, 'test-col', 'START','STOP')

In [236]: test
Out[236]: 
    0  1  2  3
0   a  b  c  d
1   a  b  c  d
2   a  b  c  d
3   a  b  c  d
4   a  b  c  d
5   a  b  c  d
6   a  b  c  d
7   a  b  c  d
8   a  b  c  d
9   a  b  c  d
10  a  b  c  d

To follow up here, I ran some time tests on both functions. Here are the results:

In [1]: import pandas as pd

In [2]: import timeit

In [3]: df = pd.DataFrame(['a|b|c|d']*10000, columns=['test-col'])

In [4]: df.ix[2654, ['test-col']] = 'START'

In [5]: df.ix[9000, ['test-col']] = 'STOP'

In [6]: %paste
def func_1(data, column, start, stop):

        store = []

        for idx, item in enumerate(data[column]):
                if item in [start, stop]:
                        store.append(idx)

        output = pd.DataFrame(list(data[column][store[0]+1:store[1]].str.split('|')))

        return output


def func_2(df):

        start, end = df[df['test-col'].str.contains('START|STOP')].index.tolist()
        output = pd.DataFrame(df[start+1:end]['test-col'].str.split('|').tolist())

        return output

## -- End pasted text --

In [7]: %timeit func_1(df, 'test-col', 'START','STOP')
100 loops, best of 3: 8.95 ms per loop

In [8]: %timeit func_2(df)
100 loops, best of 3: 13.3 ms per loop

My hypothesis why func_1() is faster than func_2() is because of the .contains() method which uses regex to pattern match.

**gbrener** · Accepted Answer · 2015-06-20T05:23:52+00:00

>>> import pandas as pd
>>> df = pd.DataFrame(['abcdef', 'START-OF-DATA', 'g|h|i', 'j|k|l', 'm|n|o', 'END-OF-DATA', 'pqrstu', columns=['A']])
>>> df
               A
0         abcdef
1  START-OF-DATA
2          g|h|i
3          j|k|l
4          m|n|o
5    END-OF-DATA
6         pqrstu
>>> start, end = df[df['A'].str.contains('(START|END)-OF-DATA')].index.tolist()
>>> pd.DataFrame(df[start+1:end]['A'].str.split('|').tolist(), columns=['A', 'B', 'C'])
   A  B  C
2  g  h  i
3  j  k  l
4  m  n  o

TechQA.

Python Pandas filter rows based on the string value of an entry

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in BLOOMBERG

Popular Questions

Popular Tags

Trending Questions