Does a unique solution exist for optimizing the cross entropy in binary logistic regression problem?

56 views Asked by At

I tried to build the logistic regression model from scratch. The data I used is the Iris dataset. Let me used the example in Chapter 4 in Geron's ML book. I wanted to observe whether or not the three methods result in the same model parameters.

import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris["data"][:,3:]
y = ( iris["target"] == 2 ).astype(int)

I first went with the logistic regression model in sci-kit learn.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)

model.coef_.item(), model.intercept_.item()

showed the slope and intercept values (4.33, -7.19). And then I tried building the cross entropy objective function, and used the minimize in scipy to find the slope and intercept in the model

from scipy.optimize import minimize

XX = np.hstack((X, np.ones((150,1))))

def obj(w, xa, ya):
    logit = xa.dot(w).reshape(-1,1)
    pred = 1/(1 + np.exp((-1.)*logit)).reshape(-1,1)
    return (-1.)*np.mean(ya * np.log(pred) + (np.ones((150,1)) - ya) * np.log(np.ones((150,1)) - pred))

res = minimize(obj, args = (XX, y), x0 = np.array([0.5,0.5]), method='BFGS', options={'gtol':1e-2})

res.x

And the model showed the pair of values are (8.16, -13.34).

I also tried the third way using gradient descent (below), and got a different pair of values again.

total = 10000
theta = np.random.randn(2,1)
learning_rate = 0.15

for i in range(total):
    pred = 1. / (1. + np.exp((-1.) * XX.dot(theta)) )
    deltaT = pred - y
    theta = theta - learning_rate * XX.T.dot(deltaT) / 150.

Some observations are in order.

  1. The three methods result in the same prediction values
  2. The end values of the objective function are different
  3. intercept/slope ratio is close for all three, and in the binary classification problem with one-D input this ratio seems the only relevant number.

The objective function is convex, i.e. the 2nd order derivative always positive, but doesn't it mean we will always reach the global minimum from a gradient descent point of view? No matter where the initial point is, right?

0

There are 0 answers