checking if for any image file (.JPG) in folder "A" there is an annotation file (.XML) in folder "B"

923 views Asked by At

I have a very big datasets of images and their annotations saved in two separate folders, however not all images have an annotation file. How can I write a python code to check my image files (.JPG) in folder "A" and delete the image if there is not an annotation file (.xml) with the same name for that specific image, and do nothing if annotation file exists?

I have written the following code following @Gabip comment below: enter image description here

How can I improve this code?

2

There are 2 answers

5
Gabio On BEST ANSWER

try this:

from os import listdir,remove
from os.path import isfile, join

images_path = "full/path/to/folder_a"
annotations_path = "full/path/to/folder_b"


# this function will help to retrieve all files with provided extension in a given folder
def get_files_names_with_extension(full_path, ext):
    return [f for f in listdir(full_path) if isfile(join(full_path, f)) and f.lower().endswith(".{}".format(ext))]


images = get_files_names_with_extension(images_path, "jpg")
annotations = set([f.split(".")[0] for f in get_files_names_with_extension(annotations_path, "xml")])

for img in images:
    if img.split(".")[0] not in annotations:
        remove(join(images_path, img))
0
Diego R. Moraes On

I was having the same problem. I made some adaptations to your suggestion. Now:

  • the number of IMAGEs and XMLs is displayed
  • images are compared with XMLs
  • XMLs are compared with images
  • instead of erasing inconsistencies, lists are actually created with the names of the missing files

CHECK: (IMGs x XMLs) and (XMLs x IMGs)

from os import listdir
from os.path import isfile, join

images_path = "full/path/to/folder_a"
annotations_path = "full/path/to/folder_b"


# function created to return a list of all files in the "full_path" directory with an "ext" extensiondef get_files_names_with_extension(full_path, ext):
    return [f for f in listdir(full_path) if isfile(join(full_path, f)) and f.lower().endswith(".{}".format(ext))]

# use the function to retrieve the NAME of IMGs and XMLs WITHOUT EXTENSION (facilitates the conference)images = set([f.split(".")[0] for f in get_files_names_with_extension(images_path, "jpg")])
annotations = set([f.split(".")[0] for f in get_files_names_with_extension(annotations_path, "xml")])
print('='*30)
print(f'number of IMGs = {len(images)}')
print(f'number of XMLs = {len(annotations)}')

# create a list of all IMGs looking for the one that does not have the corresponding XML
print('='*30)
list_error_img = []
for img in images:
    if img not in annotations:
        list_error_img.append(img)
if not list_error_img:
    print("OK, all IMG has its XML")        
else:
    print("ERROR: IMGs that do not have XML")
    print(list_error_img)

# creates a list of all XMLs looking for what does not have the corresponding IMG
print('='*30)
list_error_xml = []
for ann in annotations:
    if ann not in images:
        list_error_xml.append(ann)
if not list_error_xml:
    print("OK, all XML has its IMG")        
else:
    print("ERRO: XMLs tha do not have IMG")
    print(list_error_xml)
print('='*30)