I am trying to do an image inpainting task using a transformer model. I randomly mask out some pixels in an image (setting the corresponding positions to 0), then divide it into patches like ViT does. I embed each patch (using convolution or FC) to get a 1D sequence and feed that into the decoder part of the transformer (depth = 6) to generate the original image size. I use MSE loss between the masked pixels and generated pixels.
However, my model is currently not converging. I have tried varying learning rate, batch size, and patch size with no success. The output is very noisy/artifacted. To avoid slow training with large images, I am using the 28x28 images from FashionMNIST (60k images). I thought this dataset could help verify if my model setup makes sense before moving to larger images.
Currently unclear why it's not converging. I discussed with a friend and he said it's because 28x28 is too small and some pixels are masked, so the model cannot learn. But I feel that a simpler distribution from smaller images should be easier to learn, and my mask ratio is not very high compared to other CV papers.
My questions:
Does image size really affect convergence that much in this case? Apart from image size, what other potential model configuration issues could lead to lack of convergence? Any other suggestions for my task?