Calculating based on rows conditions (in pandas)

85 views Asked by At

I encountered the following problem: I have a pandas dataframe that looks like this.

id_tranc sum bid
1 4000 2.3%
1 20000 3.5%
2 100000 if >=100 000 - 1.6%, if < 100 000 - 100$
3 30000 if >=100 000 - 1.6%, if < 100 000 - 100$
1 60000 500$

code_to_create_dataset:

dataframe = pd.DataFrame({
    'id_tranc': [1, 1, 2, 3, 1],
    'sum': [4000, 20000, 100000, 30000, 60000],
    'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$', 
            'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']})

Necessary to calculated 'commission', depending columns 'sum' and 'bid'. Final dataframe should be look like:

id_tranc sum bid comission
1 4000 2.3% 92
1 20000 3.5% 700
2 100000 if >=100 000 - 1.6%, if < 100 000 - 100$ 1600
3 30000 if >=100 000 - 1.6%, if < 100 000 - 100$ 100
1 60000 500$ 500

If calculated with df['commission'] = df['sum'] * df['bid'] - getting result only for first 2 record. Please tell me how to do this correctly.

5

There are 5 answers

0
mozway On BEST ANSWER

I would write a small parser based on a regex and operator:

from operator import ge, lt, gt, le
import re

def logic(value, bid):
    # define operators, add other ones if needed
    ops = {'>=': ge, '>': gt, '<': lt, '<=': le}
    # remove spaces, split conditions on comma
    conditions = bid.replace(' ', '').split(',')
    # then loop over them, the first match will be used
    for cond in conditions:
        # extract operator, threshold, commission, unit
        m = re.search('(?:if(\W+)(\d+)-)?(\d+\.?\d*)([%$])', cond)
        if not m:     # if no match, ignore
            continue
        op, thresh, com, unit = m.groups()
        # if no condition or condition is valid
        if (not op) or (op and ops[op](value, float(thresh))):
            if unit == '%':                     # handle % case
                return value * float(com)/100
            elif unit == '$':                   # handle fixed com
                return float(com)

df['comission'] = [logic(val, bid) for val, bid in zip(df['sum'], df['bid'])]

# or with apply, which is less efficient
# df['comission'] = df.apply(lambda row: logic(row['sum'], row['bid']), axis=1)

Output:

   id_tranc     sum                                       bid  comission
0         1    4000                                      2.3%       92.0
1         1   20000                                      3.5%      700.0
2         2  100000  if >=100 000 - 1.6%, if < 100 000 - 100$     1600.0
3         3   30000  if >=100 000 - 1.6%, if < 100 000 - 100$      100.0
4         1   60000                                      500$      500.0

Regex:

regex demo

(?:if(\W+)(\d+)-)?  # optionally match a condition (operator and threshold)
(\d+\.?\d*)         # match the value of the commission
([%$])              # match type of commission (% or $)

Reproducible input:

df = pd.DataFrame({'id_tranc': [1, 1, 2, 3, 1], 'sum': [4000, 20000, 100000, 30000, 60000],
                   'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$',
                           'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']})
3
Manuel On

in order to calculate the value of a third column, the other two must contain numbers (integers or floats).

This value in your third data row is not a number, so you cannot multiply anything with it:

if >=100 000 - 1.6%, if < 100 000 - 100$

Your other "bid" values ("2.3%" etc.) are also not numbers. They should be formatted as 0.023 to be used in a multiplication. Unsure how you even got you first two rows right.

1
Federicofkt On

This function should do the trick:

import pandas as pd
import re

data = {
    'id_tranc': [1, 1, 2, 3, 1],
    'sum': [4000, 20000, 100000, 30000, 60000],
    'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$', 
            'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']
}
df = pd.DataFrame(data)

def calculate_commission(row):
    if 'if' in row['bid']:
        parts = row['bid'].split('-')
        threshold = re.findall(r'\d+', parts[0])
        inter_value= ''.join(threshold)
        threshold = float(inter_value)
        if row['sum'] >= threshold:
            commission = float(re.findall(r'\d+\.*\d*', parts[1])[0]) * row['sum'] / 100
        else:
            commission = float(re.findall(r'\d+\.*\d*', parts[2])[0])
    elif '$' in row['bid']:
        commission = float(re.findall(r'\d+\.*\d*', row['bid'])[0])
    else:
        commission = float(row['bid'].replace('%', '')) * row['sum'] / 100
    return round(commission)

df['commission'] = df.apply(calculate_commission, axis=1)

In case the strings with the conditions have different formats, you can just fix the regex in accordance with what you have, this works well with the example provided

0
Loginus On

If you control input data, the clean solution would be to store function/lambda in "bid" column and then simply run apply function

dataframe = pd.DataFrame({
    'id_tranc': [1, 1, 2, 3, 1],
    'sum': [4000, 20000, 100000, 30000, 60000],
    'bid': [lambda x: x * 0.023,
            lambda x: x * 0.035,
            lambda x: 100 if x < 100000 else x * 0.016,
            lambda x: 100 if x < 100000 else x * 0.016,
            lambda x: 500]})

dataframe['comission'] = dataframe.apply(lambda row: row['bid'](row['sum']), axis=1)
0
julianf19 On

I would structure the data that comes in:

df = pd.DataFrame({
    'id_tranc': [1, 1, 2, 3, 1],
    'sum': [4000, 20000, 100000, 30000, 60000],
    'bid(%)': [2.3, 3.5, 1.6, 1.6, np.nan],
    'threshold': [0,0,100000,100000,np.inf],
    'min':[np.nan,np.nan,100,100,500]
    })

And then use this code to run your calculation:

df.loc[df["sum"]>=df["threshold"],"comission"]=df["sum"]*df["bid(%)"]/100
df.loc[df["sum"]<df["threshold"],"comission"]=df["min"]
id_tranc sum bid(%) threshold min comission
0 1 4000 2.3 0 nan 92
1 1 20000 3.5 0 nan 700
2 2 100000 1.6 100000 100 1600
3 3 30000 1.6 100000 100 100
4 1 60000 nan inf 500 500