Merging and concatinating simultaneously

48 views Asked by At

I have multiple data frames, each representing monthly progress. My mission is to join them with two condition step by step. So here is three sample data frame, and I will try to explain what I want to achieve.

import pandas as pd

data_22_3 = {
    'id_number': ['A123', 'B456', 'C789'],
    'company': ['Insurance1', 'Insurance2', 'Insurance3'],
    'type': ['A', 'A', 'C'],
    'Income': [100, 200, 300]
}
df_22_3 = pd.DataFrame(data_22_3)

data_22_4 = {
    'id_number': ['A123', 'B456', 'D012'],
    'company': ['Insurance1', 'Insurance2', 'Insurance1'],
    'type': ['A', 'B', 'B'],
    'Income': [150, 250, 400]
}
df_22_4 = pd.DataFrame(data_22_4)

data_22_5 = {
    'id_number': ['A123', 'C789', 'E034'],
    'company': ['Insurance1', 'Insurance3', 'Insurance5'],
    'type': ['A', 'C', 'B'],
    'Income': [180, 320, 500]
}
df_22_5 = pd.DataFrame(data_22_5)

So, let's take the first data frame as the main data frame. The joining first and second data frame should be as following:

If id_number of second data frame exists on the first one, then the new column of Income_2 should be added to the first data frame as the next month's income. Since the next data frame has the same columns, all of the columns except Income should be disregarded.

If id_number of second data frame does not exist on the first one, then the whole row should be added to the first data frame. The only thing to consider is to place Income value to the Income_2 column, and putting 0 to the Income column, since the value belongs to the next month.

The resulting data frame then should be joined with the next dataframe in a similar method and so on.

Even if there is a difference on values of other columns such as Type or Company, as long as id_number is the same, the former data frame value should be taken.

I might be explaining inadequately, but the result is something like this:

data_all = {
    'id_number': ['A123', 'B456', 'C789', 'D012', 'E034'],
    'company': ['Insurance1', 'Insurance2', 'Insurance3', 'Insurance1', 'Insurance5'],
    'type': ['A', 'A', 'C', 'B', 'B'],
    'Income': [100, 200, 300, 0, 0],
    'Income_2': [150, 250, 0, 400, 0],
    'Income_3': [180, 0, 320, 0, 500]
}

all_df = pd.DataFrame(data_all)
3

There are 3 answers

0
wjandrea On BEST ANSWER

Instead of simultaneously, concat first (without Income) and drop duplicates, then merge each Income.

dfs = [df_22_3, df_22_4, df_22_5]
result = (
    pd.concat(dfs)
    .drop(columns='Income')
    .drop_duplicates('id_number')
    .reset_index(drop=True)
)

For the merge, manually reindexing is the cleanest way I found to fill missing values:

for i, df in enumerate(dfs, start=1):
    result[f'Income_{i}'] = (
        df
        .set_index('id_number')
        ['Income']
        .reindex(result['id_number'], fill_value=0)
        .reset_index(drop=True)
    )

At this point I have the result's Income column labelled as Income_1 for consistency, but it can be renamed:

result.rename(columns={'Income_1': 'Income'})
  id_number     company type  Income  Income_2  Income_3
0      A123  Insurance1    A     100       150       180
1      B456  Insurance2    A     200       250         0
2      C789  Insurance3    C     300         0       320
3      D012  Insurance1    B       0       400         0
4      E034  Insurance5    B       0         0       500
0
Andrej Kesely On

You can try:

out = pd.concat(
    [
        df_22_3.set_index("id_number"),
        df_22_4.set_index("id_number")[["company", "type", "Income"]].add_suffix("_2"),
        df_22_5.set_index("id_number")[["company", "type", "Income"]].add_suffix("_3"),
    ],
    axis=1,
).reset_index()

out["company"] = out[["company", "company_2", "company_3"]].apply(
    lambda x: x[x.first_valid_index()], axis=1
)

out["type"] = out[["type", "type_2", "type_3"]].apply(
    lambda x: x[x.first_valid_index()], axis=1
)

out = out.drop(columns=["type_2", "type_3", "company_2", "company_3"]).fillna(0)

print(out)

Prints:

  id_number     company type  Income  Income_2  Income_3
0      A123  Insurance1    A   100.0     150.0     180.0
1      B456  Insurance2    B   200.0     250.0       0.0
2      C789  Insurance3    C   300.0       0.0     320.0
3      D012  Insurance1    B     0.0     400.0       0.0
4      E034  Insurance5    B     0.0       0.0     500.0
0
wjandrea On

You could concat all your data together first (long format), add a column for the source df number, then pivot the Incomes accordingly. For the other columns, get the first values separately and join.

dfs = [df_22_3, df_22_4, df_22_5]
df = pd.concat([d.assign(n=n) for n, d in enumerate(dfs, start=1)])

# Columns
primary = 'id_number'
pivot = 'Income'
number = 'n'

income_pivot = (
    # Same as `df.pivot`, but with a `fill_value`
    df.set_index([primary, number])[pivot].unstack(number, fill_value=0)
    # Change labels
    .add_prefix(f'{pivot}_').rename_axis(columns=None)
)

other_cols_first = df.drop(columns=[pivot, number]).groupby(primary).first()

result = other_cols_first.join(income_pivot)
              company type  Income_1  Income_2  Income_3
id_number
A123       Insurance1    A       100       150       180
B456       Insurance2    A       200       250         0
C789       Insurance3    C       300         0       320
D012       Insurance1    B         0       400         0
E034       Insurance5    B         0         0       500

At this point, to get the exact output you showed, you can just rename the Income_1 column and reset the index.

result.rename(columns={'Income_1': 'Income'}).reset_index()
  id_number     company type  Income  Income_2  Income_3
0      A123  Insurance1    A     100       150       180
1      B456  Insurance2    A     200       250         0
2      C789  Insurance3    C     300         0       320
3      D012  Insurance1    B       0       400         0
4      E034  Insurance5    B       0         0       500