I'm working on a project that involves using a Deep Q-Learning Agent to learn the appropriate way to select certain nodes of a 1000-node NetworkX Graph. My observation_space is a (1000, 3) array, with each row representing the node label, it's degree, and a variable/attribute (either 0, 1, or 2). The action_space has a (1000, 1) shape, with each element corresponding to taking an action on a specified node.
This is the code for my Deep Q Network:
def nnmodel(observation_space, action_space):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(128, input_shape = (None, observation_space.shape[0], observation_space.shape[1]), activation='relu'))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dense(len(action_space), activation='linear'))
model.compile(optimizer=Adam(), loss='mse', metrics = ['accuracy'])
return model
which is, in theory and as I understand, supposed to give me the q-values from
q_values = model.predict(observation_space)
However, my q_values has the shape (1000, 1000) and I am unsure which "highest q-value" I should be considering that corresponds to the node that the agent should perform an action on. Is it the highest q-value entry, for which the row / column corresponds to the node the agent should be selecting? Or is it the largest row/column sum? Or is it something else entirely? Examples I've looked at online typically use np.argmax(q_values[0]), which I feel does not apply to my case.
Also, does my input_shape look correct for the problem I'm describing?
Any help is appreciated!
max_q = np.max(q_values)
position = np.where(q_values == max_q)
print(position)
This returns the index of the largest q-value. I'm unsure if this means I should be selecting the i-th/j-th node for the ith row or j-th column.