Why does this code encode all files with the same hash?
import asyncio
import wget
import hashlib
import os
import tempfile
import zipfile
from multiprocessing import Pool
async def main() -> None:
url = "https://gitea.radium.group/radium/project-configuration/archive/master.zip"
if not os.path.isdir('tempdir'):
os.makedirs('tempdir')
os.chdir('tempdir')
wget.download(url)
zipfile.ZipFile("project-configuration-master.zip").extractall('.')
os.chdir("project-configuration/nitpick")
async def result():
for i in range(2):
task = asyncio.create_task(main())
await task
os.chdir("../..")
for f in os.listdir():
print(hashlib.sha256(b'{f}'))
os.remove("tempfile")
asyncio.run(result())
Number of encoded files in one pass of the loop 8 but in the end it gets an error after outputting 22 file hashes, and it should be 24:
While working on the example that would clean up the directory handling, I found out your main perceived problem: you are passing the file names to hashlib.sha256 - were you most likely wan to pass the file contents - My example bellow underscores that, and since I switched to use the more modern "pathlib", also use its functionalities to read each file contents.
Now, for my initial findings when first glancing at your script:
You are using
os.chdirwhich changes a global application state - it looks like you are expecting each task there would have it is independent working directory, so thatos.chdirwould just change into the same directories across calls.Not - the first call there switches the whole application into your zip-extracted sub directories. The second call to main will take place when the app is already inside that directory.
If it works at all (I'd have to run it, or track the results line by line), you will have all your results deeply nested in recursive
tempdir/project-configuration/nitpickdirectories.Second thing, maybe not related with your question, but this code is not concurrent at all: you put
async deffor yourmainfunction, but in no point in there there is another asynchronous call (characterized by theawait) keyword: the function will just run to completion before the asyncio loop can switch to another task.The
wgetcall in this case would be the natural call to be made async there - it being a 3rdy party lib, check its docs if there is an equivalent async call - that would do it. Otherwise, you can run it in another thread with asyncio'sloop.run_in_executor- that would also make the code concurrent.Given the structure here, I suppose you tried to adapt code that was using mutiprocessing at one time: if
mainwas run in a different process for each task, each one would have a separate working dir, and calls towgetwould depend only on the OS parallelizing the processes execution: everything would indeed work. Not the case for asyncio code.So, touching only these two parts, here is how your code could look like. First thing: not ever use
os.chdir: it will always break in anything more complex than a 10 liner script (and in this case, it could break even earlier than that) - as it depends on changing a single, non-restorable, global estate for the process. Always work with relative paths, and concatenate your paths instead. The traditional API for that -os.path.joinis too verbose, but from Python 3.5 we havepathlib.Pathwhich allows the use of the/operator to concatenate paths properly.