Python program to produce dictionary of file extensions and sizes

1.2k views Asked by At

I am trying to create a program in Python that will search through a directory of files and create a dictionary whose keys are the various file extensions in the directory, and whose values constitute lists containing the number of times that extension appears in the directory, the size of the largest file with that extension, the size of the smallest, and the average size of files with that extension.

I have written the following so far:

for root, dirs, files in os.walk('.'):
        contents={}
        for name in files:
            size=(os.path.getsize(name))
            title, extension=os.path.splitext(name)
            if extension not in contents:
                contents[extension]=[1, size, size, size]
            else:
                contents[extension][0]=contents[extension][0]+1
                contents[extension][3]=contents[extension][3]+size
                if size>=contents[extension][1]:
                    contents[extension][1]=size
                elif size<contents[extension][2]:
                    contents[extension][2]=size
        contents[extension][3]=contents[extension][3]/contents[extension][0]
        print(contents)

If I import os and use os.chdir() to enter the directory I want to explore, this script works to the extent that it returns a dictionary whose keys are the extensions in the directory, and whose values are lists that correctly identify the number of times that extension appears, the size of the largest file with that extension, and the size of the smallest. Where it goes wrong is that the average is calculated correctly in one case, but in the others it is incorrect but in inconsistent ways.

Any advice for fixing this? I'd like the dictionary to show the proper averages in each case. I'm new to Python, and programming, and am clearly missing something!

Thanks in advance.

3

There are 3 answers

3
maxymoo On BEST ANSWER

In your last step,

contents[extension][3]=contents[extension][3]/contents[extension][0]

you're only performing this for a single extension, you need to loop through all your extensions:

for extension in contents:
    contents[extension][3]=contents[extension][3]/contents[extension][0]
0
alexis On

One thing that's certainly a problem is that to get the size of a file, you need to use the correct relative path. When os.walk() recurses into a subdirectory, the relative path is root+"/"+name -- not just name. So you should be getting the size like this:

size=os.path.getsize(root+"/"+name)

(Your variable root is not actually the "root" of the directory tree; it is each directory whose files are being listed in files.)

Will this fix the problem? Who knows. The way your code is now it should be raising an exception, so either you don't have any subdirectories or you are not showing us your complete code.

0
Pedro Muñoz On

Try:

for root, dirs, files in os.walk('.'):
        contents={}
        for name in files:
            size=(os.path.getsize(name))
            title, extension=os.path.splitext(name)
            if extension not in contents:
                contents[extension]=[1, size, size, size]
            else:
                contents[extension][0]=contents[extension][0]+1
                contents[extension][3]=contents[extension][3]+size
                if size>=contents[extension][1]:
                    contents[extension][1]=size
                elif size<contents[extension][2]:
                    contents[extension][2]=size

        for k in contents.keys():
            contents[k][3]=contents[k][3] / float(contents[k][0])

        print(contents)

You are calculating the average only to one of the extensions, the last.

And use float, if you don't do that, the answer is not going to be exact.