Replace NaN with Condition based on another column

46 views Asked by At

I have the following dataset with NaNs:

County                                                 0
City                                                   0
State                                                  0
Postal Code                                            0
Model Year                                             0
Make                                                   0
Model                                                286
Electric Vehicle Type                                  0
Clean Alternative Fuel Vehicle (CAFV) Eligibility      0
Electric Range                                         0
Base MSRP                                              0
Legislative District                                 312
DOL Vehicle ID                                         0
Vehicle Location                                       0
Electric Utility                                       0
2020 Census Tract                                      0
dtype: int64

As shown, there are 286 missing Models and 312 missing Legislative Districts.

For the Models, I've already identified the Makes (Count = 21) that have missing values. Found on the array below:

array(['KIA', 'VOLVO', 'NISSAN', 'BMW', 'TESLA', 'HONDA', 'CHEVROLET',
       'TOYOTA', 'FORD', 'VOLKSWAGEN', 'JEEP', 'HYUNDAI', 'LAND ROVER',
       'AUDI', 'CHRYSLER', 'POLESTAR', 'SUBARU', 'CADILLAC', 'MITSUBISHI',
       'FISKER', 'RIVIAN'], dtype=object)

What I'm trying to do, is populate the missing Model's NaN with the Mode BASED on their Makes.

Ex.Model (Mode) = EV6 WHEN Make = KIA 

I've also retrieved a list of the Makes and Models with the Model column showing the Mode:

          df1 = (df.groupby('Make')['Model']
               .apply(lambda x: x.mode().iat[0])
               .reset_index())
    Make    Model
0   AUDI    E-TRON
1   AZURE DYNAMICS  TRANSIT CONNECT ELECTRIC
2   BENTLEY     BENTAYGA
3   BMW     X5
4   CADILLAC    ELR
5   CHEVROLET   BOLT EV
6   CHRYSLER    PACIFICA
7   FIAT    500
8   FISKER  KARMA
9   FORD    MUSTANG MACH-E
10  GENESIS     GV60
11  HONDA   CLARITY
12  HYUNDAI     IONIQ 5
13  JAGUAR  I-PACE
14  JEEP    WRANGLER
15  KIA     NIRO
16  LAND ROVER  RANGE ROVER SPORT
17  LEXUS   NX
18  LINCOLN     AVIATOR
19  LUCID   AIR
20  MAZDA   CX-90
21  MERCEDES-BENZ   GLC-CLASS
22  MINI    HARDTOP
23  MITSUBISHI  OUTLANDER
24  NISSAN  LEAF
25  POLESTAR    PS2
26  PORSCHE     TAYCAN
27  RIVIAN  R1T
28  SMART   FORTWO ELECTRIC DRIVE
29  SUBARU  SOLTERRA
30  TESLA   MODEL 3
31  TH!NK   CITY
32  TOYOTA  PRIUS PRIME
33  VOLKSWAGEN  ID.4
34  VOLVO   XC90
35  WHEEGO ELECTRIC CARS    WHEEGO

How can I:

A) Replace NaN with the Model Modes (List) BASED on their Make? B) Simplify the mode calculation, list, and replacement of NaNs in one entry instead of splitting my work (If any)?

Thanks a lot, any input will be valuable.

Generated list of Makes and Models (Mode) using:

            df1 = (df.groupby('Make')['Model']
               .apply(lambda x: x.mode().iat[0])
               .reset_index())

Expecting to replace NaNs with the Model Modes according to their Make.

2

There are 2 answers

0
Leo On

I would say that the easiest way to do this is by masking out the nan values and then replacing from a dictionary.

import pandas as pd
import numpy as np

# Creating example dataset
df = pd.DataFrame({"brand": ["Audi", np.nan], "model": ["a4", "ELR"]})

# The dictionary with model to brand
model_to_brand = {"ELR": "CADILLAC", "E-TRON": "Audi"}

mask = df["brand"].isna() # Creating a boolean mask

# Only replacing values of the cars selected by the mask
df.loc[mask, "brand"] = df.loc[mask, "model"].replace(model_to_brand)

# Printing out the modified dataframe
print(df)

Hope this helps!

0
Mushood Hanif On

I understand from your question that you want to replace all the NaN values in Model with the Mode (Most Common Value) based on the values in Make. This can be done using the pandas library in Python. The code is as follows:

import pandas as pd

# Assuming you already have your dataset loaded into a DataFrame called 'df'

# Create a dictionary to store the mode for each 'Make'
make_mode_dict = {}

# Iterate through unique 'Make' values
for make in df['Make'].unique():
    # Filter the DataFrame to rows with the current 'Make' value and 'Model' not NaN
    make_subset = df.loc[(df['Make'] == make) & df['Model'].notna(), 'Model']
    
    # Find the mode of 'Model' for the current 'Make'
    mode_value = make_subset.mode().iloc[0]
    
    # Store the mode in the dictionary with 'Make' as the key
    make_mode_dict[make] = mode_value

# Function to replace NaN 'Model' values based on 'Make'
def replace_nan_model(row):
    if pd.isna(row['Model']):
        return make_mode_dict.get(row['Make'], None)
    return row['Model']

# Apply the function to fill NaN values in the 'Model' column
df['Model'] = df.apply(replace_nan_model, axis=1)

In the code above, we first create a dictionary make_mode_dict to store the mode of the Model column for each unique Make value. Then, we iterate through each unique Make value, filter the DataFrame to rows with that specific Make value and non-NaN Model values, find the mode of the Model column for that Make, and store it in the dictionary.

After that, we define a function replace_nan_model that takes a row from the DataFrame as input. If the Model value in the row is NaN, it looks up the mode from the make_mode_dict based on the corresponding Make value and returns the mode value. If the Model value is not NaN, it returns the original value.

Finally, we apply the replace_nan_model function to the DataFrame using the apply method along axis=1, which means we apply the function row-wise to fill in the NaN values in the Model column based on the Make value.

After running this code, the Model column will hopefully have NaN values replaced with the mode for each Make value in the DataFrame. Hope this helps!