Understanding /proc/sys/vm/lowmem_reserve_ratio

14.2k views Asked by At

I am not able to understand the meaning of the variable "lowmem_reserve_ratio" by reading the explanation from Documentation/sysctl/vm.txt. I have also tried to google it but all the explanations found are exactly similar as present in vm.txt.

It will be really helpful if sb explains it or mention some link about it. Here goes the original explanation:-

The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256     256     32
-
Note: # of this elements is one fewer than number of zones. Because the highest
      zone's value is not necessary for following calculation.

But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.

-
Node 0, zone      DMA
  pages free     1355
        min      3
        low      3
        high     4
        :
        :
    numa_other   0
        protection: (0, 2004, 2004, 2004)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  pagesets
    cpu: 0 pcp: 0
        :
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.

In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.
zone[i]'s protection[j] is calculated by following expression.

(i < j):
  zone[i]->protection[j]
  = (total sums of present_pages from zone[i+1] to zone[j] on the node)
    / lowmem_reserve_ratio[i];
(i = j):
   (should not be protected. = 0;
(i > j):
   (not necessary, but looks 0)

The default values of lowmem_reserve_ratio[i] are
    256 (if zone[i] means DMA or DMA32 zone)
    32  (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total present
pages of higher zones on the node.

If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).
3

There are 3 answers

0
Victor Choy On

I find the kernel source code that explain very well and clear.

    /*
 * setup_per_zone_lowmem_reserve - called whenever
 *  sysctl_lowmem_reserve_ratio changes.  Ensures that each zone
 *  has a correct pages reserved value, so an adequate number of
 *  pages are left in the zone after a successful __alloc_pages().
 */
static void setup_per_zone_lowmem_reserve(void)
{
    struct pglist_data *pgdat;
    enum zone_type j, idx;

for_each_online_pgdat(pgdat) {
    for (j = 0; j < MAX_NR_ZONES; j++) {
        struct zone *zone = pgdat->node_zones + j;
        unsigned long managed_pages = zone->managed_pages;

        zone->lowmem_reserve[j] = 0;

        idx = j;
        while (idx) {
            struct zone *lower_zone;

            idx--;

            if (sysctl_lowmem_reserve_ratio[idx] < 1)
                sysctl_lowmem_reserve_ratio[idx] = 1;

            lower_zone = pgdat->node_zones + idx;
            lower_zone->lowmem_reserve[j] = managed_pages /
                sysctl_lowmem_reserve_ratio[idx];
            managed_pages += lower_zone->managed_pages;
        }
    }
}

/* update totalreserve_pages */
calculate_totalreserve_pages();
}

And here even list an demo.

    /*
 * results with 256, 32 in the lowmem_reserve sysctl:
 *  1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
 *  1G machine -> (16M dma, 784M normal, 224M high)
 *  NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
 *  HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
 *  HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA
 *
 * TBD: should special case ZONE_DMA32 machines here - in those we normally
 * don't need any ZONE_NORMAL reservation
 */
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
#ifdef CONFIG_ZONE_DMA
     256,
#endif
#ifdef CONFIG_ZONE_DMA32
     256,
#endif
#ifdef CONFIG_HIGHMEM
     32,
#endif
     32,
};

In a word, the expression looks like,

zone[1]->lowmem_reserve[2] =  zone[2]->managed_pages / sysctl_lowmem_reserve_ratio[1] 
zone[0]->lowmem_reserve[2] =  (zone[1] + zone[2])->managed_pages / sysctl_lowmem_reserve_ratio[0] 
0
cha5on On

I found the wording in that document really confusing too. Looking at the source in mm/page_alloc.c helped to clear it up, so let me try my hand at a more straightforward explanation:

As is said in the page you quoted, these numbers "are reciprocal number of ratio". Worded differently: these numbers are divisors. So when calculating the reserve pages for a given zone in a node, you take the sum of pages in that node in zones higher than that one, divide it by the provided divisor, and that's how many pages you're reserving for that zone.

Example: let's assume a 1 GiB node with 768 MiB in zone Normal and 256 MiB in zone HighMem (assume no zone DMA). Let's assume the default highmem reserve "ratio" (divisor) of 32. And let's assume the typical 4 KiB page size. Now we can calculate the reserve area for zone Normal:

  1. Sum of "higher" zones than zone Normal (just HighMem): 256 MiB = (1024 KiB / 1 MiB) * (1 page / 4 KiB) = 65536 pages
  2. Area reserved in zone Normal for this node: 65536 pages / 32 = 2048 pages = 8 MiB.

The concept stays the same when you add more zones and nodes. Just remember that the reserved size is in pages---you never reserve a fraction of a page.

1
JoKoT3 On

having the same problem as you, I googled (a lot) and stumbled apon this page which might (or might not) be more understandable than the kernel doc.

(I do not quote here because it will be unreadable)