rocksdb write stall with many writes of the same data

1.1k views Asked by At

First of all I have to explain our setup, as it is quite different to the normal database server with raidgroups and so on. Usually our customers buy a server like a HP DL 380 Gen10 with two 300 GB HDDs (not SSDs) which are configured as a RAID 1 and running with Windows.

We are only managing the meta data of other storages here, so that a client can ask us and find its information on those large storages.

As our old database was always corrupt, we searched for a new more stable database which also has not that much overhead and found rocksdb, currently with version 6.12.0.

Unfortunately after it ran for some hours, it seems to block my program for many minutes due to a write stall:

2020/10/01-15:58:44.678646 1a64 [WARN] [db\column_family.cc:876] [default]
 Stopping writes because we have 2 immutable memtables (waiting for flush), 
max_write_buffer_number is set to 2

Am I right that the write workload is too much for the machine?

Our software is a service which retrieves at least one update per second from up to 2000 different servers (limit might be increased in the future if possible). Most of the time, the same database entries are only updated/written again, as one of the informations inside the entry is the current time of the respective server. Of course I could try to write the data less often to hdd, but if then a client requests this data from us, our information would not be up to date.

So my questions are:

  1. I assume that currently every writerequest is really written to the disk, is there a way to enable some kind of cache (or maybe increase its size, if not sufficient?) so that the data is not written less often to the hdd, but read requests return the correct data from the memory?

  2. I also see that there is a merge operator, but I'm not sure when this merge would take place? Is there already a cache like mentioned under 1. and the data is collected for some time, then merged and then written to hdd?

  3. Are there any other optimizations which could help me in this situation?

Any help is welcome. Thanks in advance.

Here some more loglines if they might be interesting:

** File Read Latency Histogram By Level [default] **

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp 
Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
  L0      2/0   17.12 MB   0.5      0.0     0.0      0.0       0.4      0.4       0.0   1.0      0.0     81.5      5.15              0.00        52    0.099       0      0
  L1      3/0   192.76 MB   0.8      3.3     0.4      2.9       3.2      0.3       0.0   8.0    333.3    327.6     10.11              0.00        13    0.778   4733K   119K
  L2     20/0    1.02 GB   0.4      1.6     0.4      1.2       1.2     -0.0       0.0   3.1    387.5    290.0      4.30              0.00         7    0.614   2331K   581K
 Sum     25/0    1.22 GB   0.0      4.9     0.8      4.1       4.9      0.7       0.0  11.8    257.4    254.5     19.56              0.00        72    0.272   7064K   700K
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
 Low      0/0    0.00 KB   0.0      4.9     0.8      4.1       4.5      0.3       0.0   0.0    349.5    316.4     14.40              0.00        20    0.720   7064K   700K
High      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.4      0.4       0.0   0.0      0.0     81.7      5.12              0.00        51    0.100       0      0
User      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     50.9      0.03              0.00         1    0.030       0      0
Uptime(secs): 16170.9 total, 0.0 interval
Flush(GB): cumulative 0.410, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4.86 GB write, 0.31 MB/s write, 4.92 GB read, 0.31 MB/s read, 19.6 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **
2020/10/01-15:53:21.248110 1a64 [db\db_impl\db_impl_write.cc:1701] [default] New memtable created with log file: #10465. Immutable memtables: 0.
2020/10/01-15:58:44.678596 1a64 [db\db_impl\db_impl_write.cc:1701] [default] New memtable created with log file: #10466. Immutable memtables: 1.
2020/10/01-15:58:44.678646 1a64 [WARN] [db\column_family.cc:876] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2020/10/01-16:02:57.448977 2328 [db\db_impl\db_impl.cc:900] ------- DUMPING STATS -------
2020/10/01-16:02:57.449034 2328 [db\db_impl\db_impl.cc:901] 
** DB Stats **
Uptime(secs): 16836.8 total, 665.9 interval
Cumulative writes: 20M writes, 20M keys, 20M commit groups, 1.0 writes per commit group, ingest: 3.00 GB, 0.18 MB/s
Cumulative WAL: 20M writes, 0 syncs, 20944372.00 writes per sync, written: 3.00 GB, 0.18 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 517K writes, 517K keys, 517K commit groups, 1.0 writes per commit group, ingest: 73.63 MB, 0.11 MB/s
Interval WAL: 517K writes, 0 syncs, 517059.00 writes per sync, written: 0.07 MB, 0.11 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

Our watchdog caused a breakpoint after a mutex was blocked for several minutes, after that this showed up in the logfile:

2020/10/02-17:44:18.602776 2328 [db\db_impl\db_impl.cc:900] ------- DUMPING STATS -------
2020/10/02-17:44:18.602990 2328 [db\db_impl\db_impl.cc:901] 
** DB Stats **
Uptime(secs): 109318.0 total, 92481.2 interval
Cumulative writes: 20M writes, 20M keys, 20M commit groups, 1.0 writes per commit group, ingest: 3.00 GB, 0.03 MB/s
Cumulative WAL: 20M writes, 0 syncs, 20944372.00 writes per sync, written: 3.00 GB, 0.03 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
  L0      2/0   17.12 MB   0.5      0.0     0.0      0.0       0.4      0.4       0.0   1.0      0.0     81.5      5.15              0.00        52    0.099       0      0
  L1      3/0   192.76 MB   0.8      3.3     0.4      2.9       3.2      0.3       0.0   8.0    333.3    327.6     10.11              0.00        13    0.778   4733K   119K
  L2     20/0    1.02 GB   0.4      1.6     0.4      1.2       1.2     -0.0       0.0   3.1    387.5    290.0      4.30              0.00         7    0.614   2331K   581K
 Sum     25/0    1.22 GB   0.0      4.9     0.8      4.1       4.9      0.7       0.0  11.8    257.4    254.5     19.56              0.00        72    0.272   7064K   700K
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
 Low      0/0    0.00 KB   0.0      4.9     0.8      4.1       4.5      0.3       0.0   0.0    349.5    316.4     14.40              0.00        20    0.720   7064K   700K
High      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.4      0.4       0.0   0.0      0.0     81.7      5.12              0.00        51    0.100       0      0
User      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     50.9      0.03              0.00         1    0.030       0      0
Uptime(secs): 109318.0 total, 92481.2 interval
Flush(GB): cumulative 0.410, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4.86 GB write, 0.05 MB/s write, 4.92 GB read, 0.05 MB/s read, 19.6 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 1 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **
1

There are 1 answers

1
Jose Fernando Lopez Fernandez On

RocksDB does have a built-in cache, which you can definitely explore, but I suspect your write stalls can be handled by increasing either the max_write_buffer_number or max_background_flushes configuration parameters.

The former will reduce the size of the memory table table to flush, while the latter will incerase the number of background threads available for flushing operations.

It's hard to say whether the workload is too high for the machine just from this information, because there are a lot of moving parts, but I doubt it. For starters, two background flush threads is pretty low. Also, I'm obviously not familiar with the workload you're testing for, but since you're not using the caching mechanism, the performance can only improve from here.

Disk-writes are tricky. Any linux application "writing to disk" is really calling the write system call and thus writing to an intermediate kernel buffer which is then written to its intended destination once the kernel deems fit and proper. The process of actually updating the data on the disk is known as writeback (I highly recommend reading the file I/O chapter from Robert Love's Linux System Programming for more on this).

You can call use the fsync(2) system call to manually commit a file's system cache to disk, but it's not recomended, since the kernel tends to know when the best time to do this is.

Before rolling any bespoke optimizations, I would checkout the tuning guide. If tuning the configuration for your workload doesn't solve the problem, I would probably look into a RAID configuration. Write stalls are a disk I/O throughput limitation, so data striping might give you enough of a boost, even giving you the option to add some redundancy into your storage solution, depending on the RAID configuration you opt for.

Either way, my gut feeling is that 2000 connections per second seems way too low for the problem to be the current machine's CPU or even memory. Unless each request requires a significant amount of work to process, I doubt this is it. Of course, it never hurts to make sure your endpoint's server configuration is also optimally tuned.