gnu sort - default buffer size

3.1k views Asked by At

I have read the full documentation for gnu sort and searched online but I cannot find what the default for the --buffer-size option is (which determines how much system memory the program uses when it runs). I am guessing it is somehow determined based on total system memory? (or perhaps on memory available at the time the program is begins execution?). How can I determine this?

update: I've experimented a bit and it seems that when I don't specify a particular --buffer-size value, it ends up using very little ram and thus going very slowly. It would be nice though to better understand what exactly is determining this behavior.

2

There are 2 answers

0
Mr. Llama On BEST ANSWER

I went digging through the coreutils sort source code and found these functions: default_sort_size and sort_buffer_size.

It turns out that --buffer-size (sort_size in the source code) isn't the target buffer size but rather the maximum buffer size. If no --buffer-size value is specified, the default_sort_size function is used to determine a safe maximum buffer size. It does this based on resource limits, available memory, and total memory. A summary of the function is as follows:

size = MIN(SIZE_MAX, resource_limit) / 2;
mem  = MAX(available_memory, total_memory / 8);

if ( size > total_memory * 0.75 )
    size = total * 0.75;

buffer_max = MIN(mem, size);
buffer_max = MAX(buffer, MIN_SORT_SIZE);

The other function, sort_buffer_size, is used to determine exactly how much memory to allocate for the given input files. A summary of the function is as follows:

if (sort_size is set)
    size_bound = sort_size;
else
    size_bound = default_sort_size();

buffer_size = line_bytes + 2;

for each input_file
    if (input_file is regular)
        file_size = input_file_size;
    else
        if (sort_size is set)
            return sort_size;
        else
            file_size = guess;

    worst_case = file_size * worst_case_per_input_byte + 1;

    if (worst_case overflows || size + worst_case >= size_bound)
        return size_bound;
    else
        size += worst_case;

return size;

Possibly the most important point of the sort_buffer_size function is that if you're sorting data from STDIN or a pipe, it will automatically default to sort_size (i.e. --buffer-size) if it was provided. Otherwise, for regular files it will make some rough calculations based on the file sizes and only use sort_size as an upper limit.

0
Doctor J On

To summarize in English, the defaults are:

Reading from a real file: Use all free memory, up to 3/4 and not less than 1/8 of total memory.

(If there is a process (rusage) memory limit in effect, sort will not use more than half of that.)

Reading from a pipe: Use a small, fixed amount (tens of MB).
You will probably want -S.

Current for GNU coreutils 8.29, Jan 2018.