Identify a Boolean in large datasets in Python

59 views Asked by At

Please is there a python function to identify a boolean in large dataset ? with 30+ column ?

The beneficiary summary file has several chronic illness columns for each member. These are Boolean fields. 1)Convert these columns into a single categorical variable, concatenating multiple True diagnoses. 2)If a member has 3 or more chronic conditions, categorise these as “Multiple”

This is the link to the data set

https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2009_Beneficiary_Summary_File_Sample_20.zip

This is the several chronic illness columns SP_ALZHDMTA
SP_CHF
SP_CHRNKIDN
SP_CNCR
SP_COPD
SP_DEPRESSN
SP_DIABETES
SP_ISCHMCHT
SP_OSTEOPRS
SP_RA_OA
SP_STRKETIA

1

There are 1 answers

3
Rexy Gamaliel On

I'm assuming that the value 2 corresponds to having the illness and 1 otherwise. The boolean values of all illnesses can be concatenated into a single column by assigning a unique bit position to each illness. You can then sort of "toggle" these bits depending on whether or not a given row has these illnesses. These bits are then concatenated using the bitwise OR (|) operator. Meanwhile, you can keep count of the number of illnesses for each row in a separate column.

# Define the relevant column names and a unique bit for each illness
col_bits = {
    "SP_ALZHDMTA"   : 0b10000000000,
    "SP_CHF"        : 0b01000000000,
    "SP_CHRNKIDN"   : 0b00100000000,
    "SP_CNCR"       : 0b00010000000,
    "SP_COPD"       : 0b00001000000,
    "SP_DEPRESSN"   : 0b00000100000,
    "SP_DIABETES"   : 0b00000010000,
    "SP_ISCHMCHT"   : 0b00000001000,
    "SP_OSTEOPRS"   : 0b00000000100,
    "SP_RA_OA"      : 0b00000000010,
    "SP_STRKETIA"   : 0b00000000001,
}
col_names = col_bits.keys()

# Assume 2 means having the illness
def has_illness(val):
    return int(val) == 2
def get_illness_bit(col_name, val):
    return col_bits[col_name] if val else 0b00000000000

# A pd Series containing the concatenation of bits representing relevant illnesses
illnesses_bits_col = pd.Series(np.array([0b00000000000 for _ in range(len(df))]))
# A pd Series containing the number of relevant illnesses had by each row
illnesses_counts_col = pd.Series(np.array([0 for _ in range(len(df))]))
for col_name in col_names:
    # pd Series containing bool value representations of the current illness `col_name`
    illness_col = df[col_name].apply(has_illness)

    # concatenate the bit representation of the current illness `col_name`
    illness_bit_col = illness_col.apply(lambda x: get_illness_bit(col_name, x))
    illnesses_bits_col |= illness_bit_col
    
    # add to counter the current illness `col_name`
    illness_count_col = illness_col.apply(lambda x: 1 if x else 0)
    illnesses_counts_col += illness_count_col
illnesses_counts_col = illnesses_counts_col.apply(lambda x: "Multiple" if x >= 3 else "-")

print(illnesses_bits_col)
print(illnesses_counts_col)

In total, there are 2^11 = 2048 categories for the illnesses, each value is of an integer ranging from 0-2047