Function approximator and q-learning

Question

Function approximator and q-learning

340 views Asked by mrmjauh At 25 August 2017 at 17:07

I am trying to implement q-learning with an action-value approximation-function. I am using openai-gym and the "MountainCar-v0" enviroment to test my algorithm out. My problem is, it does not converge or find the goal at all.

Basically the approximator works like the following, you feed in the 2 features: position and velocity and one of the 3 actions in a one-hot encoding: 0 -> [1,0,0], 1 -> [0,1,0] and 2 -> [0,0,1]. The output is the action-value approximation Q_approx(s,a), for one specific action.

I know that usually, the input is the state (2 features) and the output layer contains 1 output for each action. The big difference that I see is that I have run the feed forward pass 3 times (one for each action) and take the max, while in the standard implementation you run it once and take the max over the output.

Maybe my implementation is just completely wrong and I am thinking wrong. Gonna paste the code here, it is a mess but I am just experimenting a bit:

import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation

env = gym.make('MountainCar-v0')

# The mean reward over 20 episodes
mean_rewards = np.zeros(20)
# Feature numpy holder
features = np.zeros(5)
# Q_a value holder
qa_vals = np.zeros(3)

one_hot = {
    0 : np.asarray([1,0,0]),
    1 : np.asarray([0,1,0]),
    2 : np.asarray([0,0,1])
}

model = Sequential()
model.add(Dense(20, activation="relu",input_dim=(5)))
model.add(Dense(10,activation="relu"))
model.add(Dense(1))
model.compile(optimizer='rmsprop',
              loss='mse',
              metrics=['accuracy'])

epsilon_greedy = 0.1
discount = 0.9
batch_size = 16

# Experience replay containing features and target 
experience = np.ones((10*300,5+1))

# Ring buffer
def add_exp(features,target,index):
    if index % experience.shape[0] == 0:
        index = 0
        global filled_once
        filled_once = True
    experience[index,0:5] = features
    experience[index,5] = target
    index += 1
    return index

for e in range(0,100000):
    obs = env.reset()
    old_obs = None
    new_obs = obs
    rewards = 0
    loss = 0
    for i in range(0,300):

        if old_obs is not None:
            # Find q_a max for s_(t+1)
            features[0:2] = new_obs
            for i,pa in enumerate([0,1,2]):
                features[2:5] = one_hot[pa]
                qa_vals[i] = model.predict(features.reshape(-1,5))

            rewards += reward
            target = reward + discount*np.max(qa_vals) 

            features[0:2] = old_obs
            features[2:5] = one_hot[a]

            fill_index = add_exp(features,target,fill_index)

            # Find new action
            if np.random.random() < epsilon_greedy:
                a = env.action_space.sample()
            else:
                a = np.argmax(qa_vals)
        else:
            a = env.action_space.sample()

        obs, reward, done, info = env.step(a)

        old_obs = new_obs
        new_obs = obs

        if done:
            break

        if filled_once:
            samples_ids = np.random.choice(experience.shape[0],batch_size)
            loss += model.train_on_batch(experience[samples_ids,0:5],experience[samples_ids,5].reshape(-1))[0]
    mean_rewards[e%20] = rewards
    print("e = {} and loss = {}".format(e,loss))
    if e % 50 == 0:
        print("e = {} and mean = {}".format(e,mean_rewards.mean()))

Thanks in advance!

Original Q&A

There are 1 answers

**Guilherme de Lazari** · Accepted Answer · 2017-09-16T17:48:05+00:00

There shouldn't be much difference between the actions as inputs to your network or as different outputs of your network. It does make a huge difference if your states are images for example. because Conv nets work very well with images and there would be no obvious way of integrating the actions to the input.

Have you tried the cartpole balancing environment? It is better to test if your model is working correctly.

Mountain climb is pretty hard. It has no reward until you reach the top, which often doesn't happen at all. The model will only start learning something useful once you get to the top once. If you are never getting to the top you should probably increase your time doing exploration. in other words take more random actions, a lot more...

TechQA.

Function approximator and q-learning

There are 1 answers

Related Questions in REINFORCEMENT-LEARNING

Related Questions in OPENAI-GYM

Popular Questions

Popular Tags

Trending Questions