Having problems while doing multiclass classification with tensorflow

844 views Asked by At

https://colab.research.google.com/drive/1EdCL6YXCAvKqpEzgX8zCqWv51Yum2PLO?usp=sharing

Hello,

Above, I'm trying to identify 5 different type of restorations on dental x-rays with tensorflow. i'm using the official documentation to follow the steps but now i'm kind of stucked and i need help. here are my questions:

1-i have my data on my local disk. TF example on the link above downloads the data from a different repository. when i want to test my images, do i have any other way than to use the code below ?:

import numpy as np
from keras.preprocessing import image

from google.colab import files
uploaded = files.upload()

# predicting images
for fn in uploaded.keys():
  path = fn
  img = image.load_img(path, target_size=(180, 180))
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)

  images = np.vstack([x])
  classes = model.predict(images)
  print(fn)
  print(classes)

i'm asking this because the official documentation just shows the way to test images one-by-one, like this:

img = keras.preprocessing.image.load_img(
sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch

predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])

print(
"This image most likely belongs to {} with a {:.2f} percent confidence."
.format(class_names[np.argmax(score)], 100 * np.max(score))
)

2- i'm using "image_dataset_from_directory" method, so i don't have a separate validation directory. is that ok ? or should i use ImageDataGenerator ? For testing my data, i picked some data randomly from all 5 categories by hand and put them in my test folder which has 5 subfolders as i have that number of categories. is this what i am supposed to do for prediction, also separating the test data into different folders ? if yes, how can i load all these 5 folders simultaneously at test time ?

3- i'm also supposed to create the confusion matrix. but i couldn't understand how i can apply this to my code ? some others say, use scikit-learn's confusion matrix, but this time i have to define y-true, y_pred values, which i cannot fit into this code. am i supposed to evaluate 5 different confusion matrices for 5 different predictions and how ?

4-sometimes, i observe that the validation accuracy starts much higher than the training accuracy. is this unusual ? after 3-4 epochs, train accuracy cathces the validation accuracy and continues in a more balanced way. i thought this should not be happening. is everything alright ?

5- final question, why the first epoch takes much much longer time than other epochs? in my setup, it's about 30-40 minutes to complete the first epoch, and then only about a minute or so to complete every other epoch. is there a way to fix it or does it always happen the same way ?

thanks.

1

There are 1 answers

1
MichaelJanz On BEST ANSWER

I am no expert in image processing with tf, but let me try to answer as much as possible:

1

I dont really understand this question, because you are using image_dataset_from_directory which should handle the file loading process for you. So far to me, it looks good what you are doing there.

2

Let me cite tf.keras.preprocessing.image_dataset_from_directory:

Then calling image_dataset_from_directory(main_directory, labels='inferred') will return a tf.data.Dataset that yields batches of images from the subdirectories class_a and class_b, together with labels 0 and 1 (0 corresponding to class_a and 1 corresponding to class_b).

And ImageDataGenerator:

Generate batches of tensor image data with real-time data augmentation. The data will be looped over (in batches).

As your data is handpicked, there is no need for ImageDataGenerator, as image_dataset_from_directory returns what you want. If you test and validation data (which you should have), you can use the tf.data.Dataset functions for splitting data in test, train and valid. This can be a bit clunky, but the time learning tf.data.dataset is well spent.

3

The confusion matrix give the the F1-Score, Precision and Recall values. But as the confusion matrix is normally for binary classification (which is not your case), it only returns those values for one class (and for not this class). Better use the metrics Tensorflow relies on. Tensorflow can calculate the recall and precision and F1 score for you as metric, so if you ask me, use them.

4

Depending on how the data is shuffled and structured this can be normal. When there are more special cases in the training data, the model will have more difficulties to predict them correct. When there are more simple predictions in the test labels, the model will be better there, which gives you a higher accuracy at that point. It is indeed an indicator, that the classes in your train and test data might not be equally distributed.

5

tf.data.Dataset loads the data when needed. This means, the files are not loaded into memory until the training process has started which results in a very long first epoch (loading all images first) and the second very short epoch (oh cool, all images are already there). You can approve this by checking the gpu usage of your machine, it should often be doing nothing or be very low.

To fix this, you can use .prefetch(z) on your dataset variable. ´prefetch() ´makes the dataset prefetch the next ´z´ values, while the gpu is already doing some calculations. This might speed up the first epoch.