I am working on a dataset of 5 columns (named 'Healthy', 'Growth', 'Refined', 'Reasoned', 'Accepted') and 50k rows. I divided it into a train dataset (10k) and a validation set (the rest of the dataset). I built a Bayesian Belief Network with the following edges ('Healthy', 'Refined'), ('Healthy', 'Reasoned'), ('Refined', 'Accepted'), ('Reasoned', 'Accepted'), ('Growth', 'Accepted'). I would like, in order to evaluate the quality of my network, to insert evidence in the nodes 'Healthy', 'Growth', 'Refined' and 'Reasoned', predict the value of 'Accepted' and finally compare it with the actual value in the validation set. The for loop I made stops always after 584 iterations without sending any error message and the kernel looks still busy.

Here is a simpler version of my code. I write only the version of the Network with the Maximum likelihood method for computing the parameters. The issue is the same also with other method of computing the parameters.

import pandas as pd
from pgmpy.base import DAG 
from pgmpy.models import BayesianNetwork
from pgmpy.sampling import BayesianModelSampling
from pgmpy.factors.discrete import State

#import dataset
df = pd.read_csv("C:\\Users\\puddu\\Desktop\\Tools\\Dummy.BBN\\Dummy_data_set.csv")

#preliminary operation on dataset
df.rename(columns = {'Q1.Healthy':'Healthy', 'Q2.Growth':'Growth',
                              'Q3.Refined':'Refined', 'Q9.Accepted':'Accepted',
                              'Q8.Reasoned':'Reasoned'}, inplace = True)

nodes = ('Healthy', 'Growth', 'Refined', 'Reasoned', 'Accepted')

replies = ['E','D', 'C', 'B', 'A']

edges = [('Healthy', 'Refined'),
          ('Healthy', 'Reasoned'),
          ('Refined', 'Accepted'),
          ('Reasoned', 'Accepted'),
          ('Growth', 'Accepted')]

for nod in nodes:
    df[nod]=df[nod].astype('category')
    df[nod] = df[nod].cat.set_categories(replies, ordered=True) 

#training set definition
df_train = df.head(10000).copy().reset_index(drop= True)

#directed acyclic graph building
dag = DAG()

dag.add_edges_from(ebunch= edges)

#BBN building + estimating MLE parameters
model_mle = BayesianNetwork(dag)

model_mle.fit(df_train)

df_validation = df.iloc[(10000):(11000),].copy().reset_index(drop= True)
inference_mle = BayesianModelSampling(model_mle)
mle_guesses = 0 
for i in range(1000):
    evidence = [State(var= 'Growth', state= df_validation['Growth'][i]),
                State(var= 'Healthy', state= df_validation['Healthy'][i]),
                State(var= 'Reasoned', state= df_validation['Reasoned'][i]),
                State(var= 'Refined', state = df_validation['Refined'][i])]
    mle_prediction = inference_mle.rejection_sample(size= 1,
                     evidence = evidence, show_progress= False)['Accepted'][0]
    result = df_validation['Accepted'][i]
    if mle_prediction == result:
                 mle_guesses+= 1
    print(f"Step {i}") 

Thanks to everyone will spend time in helping me.

1

There are 1 answers

0
Ankur Ankan On BEST ANSWER

The way rejection sampling works is that it simulates data from the model and keeps the data that matches the given evidence. My guess is that the probability of evidence in line 585 is extremely low, so the algorithm is stuck in a loop trying to generate a sample that matches the evidence.

Some possible solutions:

If you want to use the sampling-based inference approach. Try to just simulate some data and compute the probability of each data point. This will just approximate the probability to 0 in the case you have described above. This would be much faster as well as you would need to simulate the data only once:

n_samples = int(1e5)
df_simulated = model_mle.simulate(n_samples)
for i in range(1000):
     e = df_validation.iloc[i, :].to_dict()
     result = df.loc[np.all(df[list(e)] == pd.Series(e), axis=1)]['Accepted'].value_counts() / n_samples

The other way is that you can do exact inference:

infer = VariableElimination(model_mle)
for i in range(1000):
     result = infer.query(['Accepted'], evidence=df_validation.iloc[i, :].to_dict())