Pgmpy: expectation maximization for bayesian networks parameter learning with missing data

657 views Asked by At

I'm trying to use the PGMPY package for python to learn the parameters of a bayesian network. If I understand expectation maximization correctly, it should be able to deal with missing values. I am currently experimenting with a 3 variable BN, where the first 500 datapoints have a missing value. There are no latent variables. Although the description in pgmpy suggests that it should work with missing values, I get an error. This error only occurs when calling the function with datapoints that have missing values. Am I doing something wrong? Or should I look into another package for EM with missing values?

#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch

# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]

# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)

# create some missing values
new_data["Smoker"][:500] = np.NaN

# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)

The error I get is an indexing error:

  File "main.py", line 100, in <module>
    bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
  File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
    cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
    weighted_data = self._compute_weights(latent_card)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
    weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
  File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
    return op.apply().__finalize__(self, method="apply")
  File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
    return self.apply_standard()
  File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
    results, res_index = self.apply_series_generator()
  File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
    results[i] = self.f(v)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
    weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
  File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
    likelihood *= cpd.get_value(
  File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
    return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Thanks!

1

There are 1 answers

0
Pierre-Henri Wuillemin On

Since there is still no answer to your specific question, let me propose a solution with another module:

#import 
import pandas as pd
import numpy as np
import pyAgrum as gum

# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"]) 
test_data = data[:2000]
new_data = data[2000:].copy() 

# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()

# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN

# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)

In a notebook : EM in a notebook