Currently, I have a neural network, built in tensorflow that is used to classify time sequence data into one of 6 categories. The network is composed of:
2 fully connected layers -> LSTM unit -> softmax -> output
All layers have regularization in the form of dropout and or layer normalization. In order to speed up the training process, I am using mini-batching of the data, where the mini-batch size = # of categories = 6. Each mini-batch contains exactly one sample for each of the 6 categories, arranged randomly in the mini-batch. Below is the feed-forward code, where x is of shape [batch_size, number of time steps, number of features], and the various get commands are simple definitions for creating standard fully connected layers and LSTM units with regularization.
def getFullyConnected(input ,hidden ,dropout, layer, phase):
weight = tf.Variable(tf.random_normal([input.shape.dims[1].value,hidden]), name="weight_layer"+str(layer))
bias = tf.Variable(tf.random_normal([1]), name="bias_layer"+str(layer))
layer = tf.add(tf.matmul(input, weight), bias)
layer = tf.contrib.layers.batch_norm(layer,
center=True, scale=True,
is_training=phase)
layer = tf.minimum(tf.nn.relu(layer), FLAGS.relu_clip)
layer = tf.nn.dropout(layer, (1.0 - dropout))
return layer
def RNN(x, weights, biases, time_steps):
#shape the input as [batch_size*time_steps, input_depth]
x = tf.reshape(x, [-1,input_depth])
layer1 = getFullyConnected(input=x, hidden=16, dropout=full_drop, layer=1, phase=True)
layer2 = getFullyConnected(input=layer1, hidden=input_depth*3, dropout=full_drop, layer=2, phase=True)
rnn_input = tf.reshape(layer2, [-1,time_steps,input_depth*3])
# 1-layer LSTM with n_hidden units.
LSTM_cell = getLSTMcell(n_hidden)
#generate prediction
outputs, state = tf.nn.dynamic_rnn(LSTM_cell,
rnn_input,
dtype=tf.float32,
time_major=False)
#good old tensorboard saves
tf.summary.histogram('weight', weights['out'])
tf.summary.histogram('bias',biases['out'])
#there are time_steps outputs, but only grab the last output for the classification
return tf.sigmoid(tf.matmul(outputs[:,-1,:], weights['out']) + biases['out'])
Surprisingly, this network trained extremely well giving me about 99.75% accuracy on my test data (which the trained network had never seen). However, it only scored this high when I fed the training data into the network with a mini-batch size the same as during training, 6. If I only fed the training data one sample at a time (mini-batch size = 1), the network was scoring around 60%. What is weird is that, if I train the network with only single samples (mini-batch size = 1), the trained network works perfectly fine with high accuracy once the network is trained. This leads me to the weird conclusion that the network is almost learning to utilize the batch size in its learning, so much so that it becomes dependent on the mini-batch to classify correctly.
Is it a thing for a deep network to become dependent on the size of the mini-batch during training, so much that the final trained network will require input data to have the same mini-batch size just to perform correctly?
All ideas or thoughts would be loved!