How to train on single image depth estimation on KITTI dataset with masking method

876 views Asked by At

I'm studying on a deep learning(supervised-learning) to estimate depth images from monocular images. And the dataset currently uses KITTI data. RGB images (input image) are used KITTI Raw data, and data from the following link is used for ground-truth.

In the process of learning a model by designing a simple encoder-decoder network, the result is not so good, so various attempts are being made.

While searching for various methods, I found that groundtruth only learns valid areas by masking because there are many invalid areas, i.e., values that cannot be used, as shown in the image below. enter image description here

So, I learned through masking, but I am curious about why this result keeps coming out.

enter image description here

[training-result_2]-->>[5]

and this is my training part of code. How can i fix this problem.

for epoch in range(num_epoch):
model.train() ### train ###
for batch_idx, samples in enumerate(tqdm(train_loader)):
    x_train = samples['RGB'].to(device)
    y_train = samples['groundtruth'].to(device)

    pred_depth = model.forward(x_train)
    valid_mask = y_train != 0     #### Here is masking 

    valid_gt_depth = y_train[valid_mask]
    valid_pred_depth = pred_depth[valid_mask]

    loss = loss_RMSE(valid_pred_depth, valid_gt_depth)
1

There are 1 answers

0
Akbar Shah On

As far as I can understand, you are trying to estimate depth from an RGB image as input. This is an ill-posed problem since the same input image can project to multiple plausible depth values. You would need to integrate certain techniques to estimate accurate depth from RGB images instead of simply taking an L1 or L2 loss between an RGB image and its corresponding depth image.

I would suggest you to go through some papers in estimating depth from single images such as: Depth Map Prediction from a Single Image using a Multi-Scale Deep Network where they use a network to first estimate the global structure of the given image and then use a second network that refines the local scene information. Instead of taking a simple RMSE loss, as you did, they use a scale-invariant error function in which the relationship between points is measured.