Storing & accessing up to 10 million files in Linux

14.2k views Asked by At

I'm writing an app that needs to store lots of files up to approx 10 million.

They are presently named with a UUID and are going to be around 4MB each but always the same size. Reading and writing from/to these files will always be sequential.

2 main questions I am seeking answers for:

1) Which filesystem would be best for this. XFS or ext4? 2) Would it be necessary to store the files beneath subdirectories in order to reduce the numbers of files within a single directory?

For question 2, I note that people have attempted to discover the XFS limit for number of files you can store in a single directory and haven't found the limit which exceeds millions. They noted no performance problems. What about under ext4?

Googling around with people doing similar things, some people suggested storing the inode number as a link to the file instead of the filename for performance (this is in a database index. which I'm also using). However, I don't see a usable API for opening the file by inode number. That seemed to be more of a suggestion for improving performance under ext3 which I am not intending to use by the way.

What are the ext4 and XFS limits? What performance benefits are there from one over the other and could you see a reason to use ext4 over XFS in my case?

2

There are 2 answers

6
Zan Lynx On BEST ANSWER

You should definitely store the files in subdirectories.

EXT4 and XFS both use efficient lookup methods for file names, but if you ever need to run tools over the directories such as ls or find you will be very glad to have the files in manageable chunks of 1,000 - 10,000 files.

The inode number thing is to improve the sequential access performance of the EXT filesystems. The metadata is stored in inodes and if you access these inodes out of order then the metadata accesses are randomized. By reading your files in inode order you make the metadata access sequential too.

5
MarkR On

Modern filesystems will let you store 10 million files all in the same directory if you like. But tools (ls and its friends) will not work well.

I'd recommend putting a single level of directories, a fixed number, perhaps 1,000 directories, and putting the files in there (10,000 files is tolerable to the shell, and "ls").

I've seen systems which create many levels of directories, this is truly unnecessary and increases inode consumption and makes traversal slower.

10M files should not really be a problem either, unless you need to do bulk operations on them.

I expect you will need to prune old files, but something like "tmpwatch" will probably work just fine with 10M files.