Slow guest performance after live snapshot via virsh (QEMU/KVM)

1k views Asked by At

I came across a weird problem for which I cannot find a solution elsewhere. Maybe you can help me.

I have a system running Ubuntu 20 LTS which is the host of six guests (four Ubuntu 20 LTS and two Windows Server 2019) and they are running quite fast up to the point where I have taken live snapshots. I'm running the guest on QEMU/KVM while using QCOW2 files and I'm using virsh to manage these virtual systems.

I take the live snapshots (without the RAM state) of the guests with the following command:

virsh snapshot-create-as $VM --no-metadata $timestamp --disk-only --atomic

This almost immediately snapshots all the virtual disks of a particular guest and creates new delta files to which the differences are written to. I then have for all guests and for all disks the following structure:

base <- snapshot <- live_delta_file

After copying away the snapshots, I commit them to their base files with the following command:

virsh blockcommit $currentVM $disk --base $path_to_base --top $path_to_snapshot --verbose --wait

After that, I delete the snapshots and all of this works without producing any errors. However, after taking the snapshots and while all the guests are still running without any errors, each VM is horribly slow with respect to any command in the shell. Furthermore, I can see via top on the host, that the RAM usage of each guest has dramatically reduced (e.g. for the Windows Server 2019 with GUI from 25 GB to 2.5 GB).

It seems, that all the cached data was removed from the RAM which - of course - strongly reduces the performance. However, taking the snapshots (without the --quiesce parameter) should not lead to this behavior, or?. After a reboot of all the guests, everything again works quite fast (while nothing was changed with respect to the snapshot-structure).

Do you have an idea which configuration or situation can lead to such a behavior?

Thank you in advance!

----- EDIT -----

It seems that the actual problem is copying away the files via scp/rsync after the snapshots were taken because one of these programs (rsync?) is eaten up all the memory on the host leading to swapping parts of the RAM of the guests to disk.

Even after the copy process has finished, the copied data seems to remain in the host cache and the guests are further using parts of the swap space of the host.

This of course explains the bad performance of the guests. It can be fixed by clearing the page cache and the swap space by using the following commands:

sync; echo 1 > /proc/sys/vm/drop_caches
swapoff -a; swapon -a

But be careful, clearing the swap space can take several hours with pausing the operation of the guests. Either it should be done at night when they are not used or the problem should be solved at its root, i.e., at the rsync/scp part.

1

There are 1 answers

0
Toon On

I recognize your experiences. I solved it by making the caching and swapping less agressive like so. Maybe it can help you too.

(from /etc/sysctl.conf)

# Make the kernel less swappy
vm.swappiness = 5

# Make the kernel free cached dentries and inodes sooner
vm.vfs_cache_pressure = 200