I am using the DQN for a resource allocation where the agent should assign the arrival requests to the best Virtual Machine. I am modifying a Cartpole code as follow:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import os
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95
self.epsilon = 1.0
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
model = Sequential()
model.add(Dense(24, input_dim=self.state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state)
return np.argmax(act_values[0])
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
def load(self, name):
self.model.load_weights(name)
def save(self, name):
self.model.save_weights(name)
The Cartpole states as the inputs of the Q network are given by the environment.
0 Cart Position
1 Cart Velocity -Inf Inf
2 Pole Angle ~ -41.8° ~ 41.8°
3 Pole Velocity At Tip
The question is that in my code what are the inputs of the Q network? Since the agent should take the best possible action based on the size of the arrival request but this is not given by the environment. Shall I feed the Q network by this input value, the size?
The inputs of the Deep Q-Network architecture is fed by the replay memory, in the following part of the code:
The dynamic of this system as shown in the original paper Deepmind paper, is that you interact with the system, store the transition in the replay memory, and then use it for the training step. In the lines above you're storing these experiences.
Basically, the input of the network is the states and outputs the Q-values. In your code, there's no interaction with the environment, that's when you can get these transitions (experiences) to feed the replay memory. So, if you can't extract some information in the environment to be represented as states, you're not able to make assumptions about that.