I'm currently trying to build a DQN agent using a helpful code I found from an online paper (Reinforcement Learning). I'm using TensorFlow 1.14.0.
I'm running the following code:
for _ in range(NN_UPDATES_PER_WARM_START):
    print('.', end='')
    # Sample a batch from the replay buffer proportionally to the probability of sampling.
    minibatch = replay_buffer.sample_minibatch(BATCH_SIZE)
    # Use batch to train an agent. Keep track of temporal difference errors during training.
    td_error = agent.train(minibatch)
    # Update probabilities of sampling each datapoint proportionally to the error.
    replay_buffer.update_td_errors(td_error, minibatch.indices)
But it raises an error in the td_error = agent.train(minibatch).
The error is the following:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-867b2667a45e> in <module>()
      4     minibatch = replay_buffer.sample_minibatch(BATCH_SIZE)
      5     # Use batch to train an agent. Keep track of temporal difference errors during training.
----> 6     td_error = agent.train(minibatch)
      7     # Update probabilities of sampling each datapoint proportionally to the error.
      8     replay_buffer.update_td_errors(td_error, minibatch.indices)
/home/lazaioan/Desktop/RAL Codes/LAL-RL-Master-Original (copy 3)/dqn.py in train(self, minibatch, n_tensorboard)
   1009             feed_dict = {self.estimator.classifier_placeholder: minibatch.classifier_state,
   1010                 ######### self.estimator.action_placeholder: [minibatch.action_state[i] for i in range(len(minibatch.action_state))], #########
-> 1011                 self.estimator.action_placeholder: np.array(minibatch.action_state),
   1012                 self._next_best_prediction: max_prediction_batch,
   1013                 self._reward_placeholder: minibatch.reward,
ValueError: could not broadcast input array from shape (3,8) into shape (3)
The train function is called from another .py file (dqn.py), and the lines where the problem occurs are the following (in minibatch.action_state):
_, loss, summ, _td_error = self.session.run(
    [self._train_op, self._loss, self.estimator.summaries, self._td_error],
    feed_dict = {self.estimator.classifier_placeholder: minibatch.classifier_state,
        self.estimator.action_placeholder: minibatch.action_state,
        self._next_best_prediction: max_prediction_batch,
        self._reward_placeholder: minibatch.reward,
        self._terminal_placeholder: minibatch.terminal,
        self.target_estimator.classifier_placeholder: minibatch.next_classifier_state,
        self.target_estimator.action_placeholder: minibatch.action_state
        }) 
Each element from minibatch is an array of shape 32 (it's also the same for me).
The problem is that in their code, minibatch.action_state has shape 3 (32 sun-arrays of 3 features each), for example every action is an array of 3 features, while I want every "action" to be a "batch of arrays of 3". So, for example, while minibatch.action_state[0] is shape (3,) in their paper, my minibatch.action_state[0] is shape (3,10) or (3,8) in another line.
For example, minibatch.action_state in THEIR PAPER is (dtype=array):
[[0.99286367, 0.9749906 , 0.97448924],
[0.36058693, 0.99352467, 0.9428495],
.....]
WHILE mibibatch.action_state in MY CODE (list of arrays) is:
[array([[0.04644732, 0.12056734, 0.36058693, 0.99352467, 0.9428495 ,
        0.01730469, 0.09757941, 0.32350904],
       [0.91286485, 0.90753378, 0.99761099, 1.0720428 , 1.07252544,
        0.90026116, 0.90931006, 0.95049693],
       [0.95847248, 0.99688911, 0.92369273, 1.16801097, 1.11937477,
        0.88151117, 0.96047654, 0.98909147]]), array([[0.53839238, 0.01215203, 0.93910424, 0.32661358, 0.14072223,
        0.04355487, 0.72415264, 0.75610243],
       [0.99404292, 1.02601031, 0.98186819, 1.00448436, 0.98644043,
        0.99286367, 0.9749906 , 0.97448924],
       [1.07852276, 0.87132646, 1.38006714, 0.70039993, 0.76713574,
        0.99388013, 1.35986748, 1.36584326]]), array([[0.21221239, 0.04032991, 0.99203309, 0.02040725, 0.83050092,
        0.41133832, 0.99203309, 0.16899745, 0.42542876, 0.8476126 ],
       [1.01898311, 0.91411494, 1.01929709, 0.95539219, 0.94656952,
        0.91259804, 1.01929709, 0.92619145, 0.91589772, 0.9621021 ],
       [1.14552466, 0.89210758, 0.99508556, 0.79885316, 0.86597793,
        0.8582233 , 0.99508556, 0.92626615, 0.84911241, 1.00648343]])
.....]
I want to pass a dynamic amount of actions in each round ( i.g. (3,10), next (3,8), next probably (3,10),. etc).
I think the problem is when building the DQN agent (initialization) (another .py file):
class Estimator: 
  
   def __init__(self, classifier_state_length, action_state_length, is_target_dqn, var_scope_name, bias_average):
     
       self.classifier_placeholder = tf.compat.v1.placeholder(tf.float32, shape=[None, classifier_state_length], name="X_classifier")
       self.action_placeholder = tf.compat.v1.placeholder(tf.float32, shape=[None, action_state_length], name="X_datapoint")
       with tf.variable_scope(var_scope_name):
           # A fully connected layers with classifier_placeholder as input
           fc1 = tf.contrib.layers.fully_connected(
               inputs=self.classifier_placeholder,
               num_outputs=10,
               activation_fn=tf.nn.sigmoid,
               trainable=not is_target_dqn,
               variables_collections=[var_scope_name],
           )
           # Concatenate the output of first fully connected layer with action_placeholder
           fc2concat = tf.concat([fc1, self.action_placeholder], 1)
           # A fully connected layer with fc2concat as input
           fc3 = tf.contrib.layers.fully_connected(
               inputs=fc2concat,
               num_outputs=5,
               activation_fn=tf.nn.sigmoid,
               trainable=not is_target_dqn,
               variables_collections=[var_scope_name]
           )
           # The last linear fully connected layer
           # The bias on the last layer is initialized to some value
           # normally it is the - average episode duriation / 2
           # like this NN find optimum better even as the mean is not 0
           self.predictions = tf.contrib.layers.fully_connected(
               inputs=fc3,
               num_outputs=1,
               biases_initializer=tf.constant_initializer(bias_average),
               activation_fn=None,
               trainable=not is_target_dqn,
               variables_collections=[var_scope_name],
           )
I think that the problem is with the action_placeholder because in their code it has shape=[None, action_state_length] while in my code, I want it to be something like shape=[None, action_state_length, None] because I want 3 features in each run, but not just a single array.
Maybe I need to change the way they define the fully connected layers, but I would appreciate any ideas or helpful links.
Thank you.