I am creating a database application that (among many other things) allows users to upload and download files. The files are stored on a file server and I have set up an Apache HTTP server with PHP scripts to process (i.e. upload and download) the files. The database only stores a link to the file and not the file itself. My question is this: How should I organize the files on my file server?
Currently, I am creating a directory structure based on the current date and I rename the files with an MD5 hash of the current date/time (including milliseconds) plus some random characters (i.e. I'm adding "salt"):
\\yyyy\mm\dd\debb40da158040e4f3b93f9576840c07
This (above) is the link that is stored in the database (of course, I also store the real file name in the database so that I can rename the file when the user goes to download it---the user never sees the actual link).
I use yyyy\mm\dd
for the directory structure to avoid performance issues (I'm told that a lot of files in the same directory can slow things down) and I rename the files with a unique string to avoid clashes when users upload files with the same name.
I'd like to get other opinions on the best way to deal with storing files in this kind of situation. I've seen some developers keep the file name, but append (as a prefix) the database ID of the corresponding row in the file information table---I see some advantages to this approach since the file names are "human readable" and you can figure out what the files are if the database file information table ever got corrupted or deleted.
How about having a structure utilising timestamp(upload date) as the 1st level directory, md5 hash of the file contents as the 2nd level(hash of file contents ensures the file is unique/name independent), upload timestamp as the 3rd(enables you to have different versions of the same file uploaded at different times), and the file with it's actual filename in the 4th level.
e.g. <date timestamp>/<md5 of file contents>/<timestamp>/<filename>
This way your dir structure will have information on:
The handicap with the file contents md5 hash is that if you have significantly large files there's going to be a slight overhead in generating.
FURTHER IDEAS
you can break the date in as your previous format (yyyy/mm/dd or even just yyyy/date) if this is a system with many users that upload files every day and is certain to create 365 dirs for each day of the year, although performance is degraded when you have a list of entries greater than 10k in a dir(and in a server based OS greater than 100k up to a few millions), so that should give you about 25-30 years before noticing any degradation if you go with just a single date dir.
the hash of file contents is the way to go to guarantee filename independency I believe, and although it will add a small overhead to calculate the md5 of the contents it is a trivial one compared to the uploading time. E.g. a 100 mb file will take an x amount of time to upload depending on connection speed, after it is uploaded you just calculate the file contents on the fly using md5sum which will add just a few seconds(5-6 for a 100 mb file) to the upload time as the user will perceive it.
you can further use the md5 of the file contents(assuming you store it in your db as well) as a signature safeguarding the authenticity of the originally uploaded file
you don't actually need a timestamp(+salt) in the dir structure in the context of the present scenario unless you want versioning of the file OR guaranteeing that the same file named differently hasn't been uploaded(otherwise you'll end up with different filenames under the same file contents md5 named dir for the given day).
not sure why you'd mind about the length of the md5 string. It's not going to affect performance and md5 is pretty widespread and well supported to use for other purposes as well(e.g. verifying file). But if you really want to cut down on length have a look at http://en.wikipedia.org/wiki/List_of_hash_functions and pick a 16 or 8 or even 4 bit crc to experiment with(again it depends on how you are going to use it, file contents or file names and how large these are).
lastly, another alternative would be to do
<group>/<user_id>/<filehash>/<timestamp>/<filename>
where group will be user ids 1 to<acceptable number of entries in a dir>
, say 10000 or less, but that's something you can find by experimenting on your server for how many entries degrade performance, and when the limit is reached you have a script creating a new group with the same structure. This way you avoid having repetitive/similar info(dates, years, months, timestamps etc), you control the acceptable limits yourself, you got the same file allowed to be uploaded by different users, you got filehash to tell whether a file has been uploaded regardless of the filename, you got versioning by using timestamp, and you got only one file in the end dir with its original(or specified) name. If you are FaceBook and have a billion users, you can have this structure and host clusters of groups of directories across different servers. If you have a small website with say 1000 users you don't even need the group bit.