Unexpected result with tempfiles and hashes

99 views Asked by At

Why does this code encode all files with the same hash?

import asyncio
import wget
import hashlib
import os
import tempfile
import zipfile
from multiprocessing import Pool

async def main() -> None:
    url = "https://gitea.radium.group/radium/project-configuration/archive/master.zip"
    if not os.path.isdir('tempdir'):
        os.makedirs('tempdir')
    os.chdir('tempdir')
    wget.download(url)
    zipfile.ZipFile("project-configuration-master.zip").extractall('.')
    os.chdir("project-configuration/nitpick")

async def result():
    for i in range(2):
        task = asyncio.create_task(main())
        await task
        os.chdir("../..")
        for f in os.listdir():
            print(hashlib.sha256(b'{f}'))
        os.remove("tempfile")

asyncio.run(result())

Number of encoded files in one pass of the loop 8 but in the end it gets an error after outputting 22 file hashes, and it should be 24:

1

There are 1 answers

0
jsbueno On

While working on the example that would clean up the directory handling, I found out your main perceived problem: you are passing the file names to hashlib.sha256 - were you most likely wan to pass the file contents - My example bellow underscores that, and since I switched to use the more modern "pathlib", also use its functionalities to read each file contents.

Now, for my initial findings when first glancing at your script:

You are using os.chdir which changes a global application state - it looks like you are expecting each task there would have it is independent working directory, so that os.chdir would just change into the same directories across calls.

Not - the first call there switches the whole application into your zip-extracted sub directories. The second call to main will take place when the app is already inside that directory.

If it works at all (I'd have to run it, or track the results line by line), you will have all your results deeply nested in recursive tempdir/project-configuration/nitpick directories.

Second thing, maybe not related with your question, but this code is not concurrent at all: you put async def for your main function, but in no point in there there is another asynchronous call (characterized by the await) keyword: the function will just run to completion before the asyncio loop can switch to another task.

The wget call in this case would be the natural call to be made async there - it being a 3rdy party lib, check its docs if there is an equivalent async call - that would do it. Otherwise, you can run it in another thread with asyncio's loop.run_in_executor - that would also make the code concurrent.

Given the structure here, I suppose you tried to adapt code that was using mutiprocessing at one time: if main was run in a different process for each task, each one would have a separate working dir, and calls to wget would depend only on the OS parallelizing the processes execution: everything would indeed work. Not the case for asyncio code.

So, touching only these two parts, here is how your code could look like. First thing: not ever use os.chdir: it will always break in anything more complex than a 10 liner script (and in this case, it could break even earlier than that) - as it depends on changing a single, non-restorable, global estate for the process. Always work with relative paths, and concatenate your paths instead. The traditional API for that - os.path.join is too verbose, but from Python 3.5 we have pathlib.Path which allows the use of the / operator to concatenate paths properly.

import asyncio
import wget
import hashlib
import os
import tempfile
import zipfile
from pathlib import Path
from functools import partial
import shutil


TEMPDIR = Path("tempdir")
CONFIGDIR = "project-configuration/nitpick"

async def main() -> None:
    url = "https://gitea.radium.group/radium/project-configuration/archive/master.zip"
    if not TEMPDIR.is_dir():    # is_dir is a method from Path objects
        TEMPDIR.mkdir()
    #wget.download(url, out=str(TEMPDIR))  
                # I had to check wget's source: it is a small utility which was not updated to  understand pathlib
                # objects. But it can take an output directory argument, like above, avoiding the use of `chdir`
                # Nonetheless this has to be made an async call with:
                
    loop = asyncio.get_running_loop()
    await loop.run_in_executor(None, partial(wget.download, url, out=str(TEMPDIR)))
            # The above call is asyncio friendly: it sends the wget workload
            # to another thread, in a transparent way, so that other tasks
            # can run concurrently to that
                                        
    zipfile.ZipFile(TEMPDIR / "project-configuration-master.zip").extractall(TEMPDIR)
    # os.chdir("project-configuration/nitpick")  # as discussed: do not do this.
    # instead, change all your file operations from this point down to prepending
    # `TEMPDIR / CONFIGDIR / "filename" ` from this point on. Calls with this to 
    # legacy functions may require it to be converted to a str -
    # `str(TEMPDIR / CONFIGDIR / "filename")`, but "open" and other 
    # Python file operations will work just fine.
    # ...

async def result():
    for i in range(2):
        task = asyncio.create_task(main())
        await task  # This will just perform all ops in sequence, not paralleizing anything
                    # but if you are parallelizing things, you might want to 
                    # parametrize tempdir - as is, the code will use the same
                    # hardcoded "tempdir" for all downloads
        # os.chdir("../..")  # No directory changed in the call - no need to change back
        for f in TEMPDIR.iterdir():  # yields all entries in directory 
            # print(hashlib.sha256(b'{f}'))  # Here is your main problem: you are really calling hashlib on the FILENAME
            hash_ = hashlib.sha256(f.read_bytes()) # here we calculate the hash on the file _contents_ 
            print(f.name, hash_.hexdigest()) # and print the actual hash, not the python repr of the hash object
        shutil.rmtree(TEMPDIR)

asyncio.run(result())