Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: pacing user writes, flushes and compactions #7

Closed
petermattis opened this issue Aug 3, 2018 · 24 comments · Fixed by #179
Closed

perf: pacing user writes, flushes and compactions #7

petermattis opened this issue Aug 3, 2018 · 24 comments · Fixed by #179
Assignees

Comments

@petermattis
Copy link
Collaborator

Add a mechanism to pace (rate limit) user writes, flushes and compactions. Some level of rate limiting of user writes is necessary to prevent user write from blowing up memory (if flushes can't keep up) or creating too many L0 tables (if compaction can't keep up). The existing control mechanisms mirror what are present in RocksDB: Options.MemTableStopWritesThreshold, Options.L0CompactionThreshold and Options.L0SlowdownWritesThreshold. These control mechanisms are blunt resulting in undesirable hiccups in write performance. The controller object was an initial attempt at providing smoother write throughput and it achieved some success, but is too fragile.

The problem here is akin to the problem with pacing a concurrent garbage collector. The Go GC pacer design should be inspiration. We want to balance the rate of dirty data arriving in memtables with the rate of flushes and compactions. Flushes and compactions should also be throttled to run at the rate necessary to keep up with incoming user writes and no faster so as to leave CPU available for user reads and other operations. A challenge here is to adjust quickly to changing user load.

@petermattis
Copy link
Collaborator Author

Not that a goal here should be removal of Options.MemTableStopWritesThreshold, Options.L0CompactionThreshold and Options.L0SlowdownWritesThreshold. The system should dynamically adjust to the load. Additionally, there should be no need for a knob to control the number of background compactions.

@petermattis petermattis changed the title Pacing user writes, flushes and compactions perf: pacing user writes, flushes and compactions Aug 11, 2018
@petermattis
Copy link
Collaborator Author

@petermattis
Copy link
Collaborator Author

@tbg
Copy link
Member

tbg commented Jan 25, 2019

I ran pebble sync -c 24 -d 10m bench on my gceworker with the following diff today to see the write stalls in action. I needed to switch to db.NoSync because otherwise syncing would push back on the writes enough to never really let the performance dip. For comparison, with syncing I would see 5-6k ops/sec. Without syncing, it's.. well, 0-100k and varying.

I think I'm seeing the bad behavior that results from not not pacing user writes gracefully pretty clearly. L0->L1 Compactions are starting to take a long time (>2min) and user writes completely stall for extended periods of time (tens of seconds). L1 seems to be growing out of proportion because there's never any time to move something into L2.

Is this a good experiment against which to measure any steps we take?

diff --git a/cmd/pebble/sync.go b/cmd/pebble/sync.go
index be6e992..e4d1642 100644
--- a/cmd/pebble/sync.go
+++ b/cmd/pebble/sync.go
@@ -60,7 +60,7 @@ func runSync(cmd *cobra.Command, args []string) {
 								log.Fatal(err)
 							}
 						}
-						if err := b.Commit(db.Sync); err != nil {
+						if err := b.Commit(db.NoSync); err != nil {
 							log.Fatal(err)
 						}
 						latency.Record(time.Since(start))
diff --git a/compaction.go b/compaction.go
index 8083835..aa41427 100644
--- a/compaction.go
+++ b/compaction.go
@@ -8,9 +8,11 @@ import (
 	"bytes"
 	"errors"
 	"fmt"
+	"log"
 	"os"
 	"path/filepath"
 	"sort"
+	"time"
 	"unsafe"
 
 	"github.com/petermattis/pebble/db"
@@ -299,7 +301,7 @@ func (c *compaction) String() string {
 	for i := range c.inputs {
 		fmt.Fprintf(&buf, "%d:", i+c.level)
 		for _, f := range c.inputs[i] {
-			fmt.Fprintf(&buf, " %d:%s-%s", f.fileNum, f.smallest, f.largest)
+			fmt.Fprintf(&buf, " %d:%q-%q", f.fileNum, f.smallest, f.largest)
 		}
 		fmt.Fprintf(&buf, "\n")
 	}
@@ -643,6 +645,11 @@ func (d *DB) compact1() (err error) {
 	if c == nil {
 		return nil
 	}
+	tBegin := time.Now()
+	defer func() {
+		elapsed := time.Since(tBegin)
+		log.Printf("compaction %s took %.2fs", c, elapsed.Seconds())
+	}()
 
 	jobID := d.mu.nextJobID
 	d.mu.nextJobID++

@petermattis
Copy link
Collaborator Author

Is this a good experiment against which to measure any steps we take?

Yes, it is a great place to start, though it isn't sufficient to fully test a throttling mechanism. It is also interesting to test a workload that has some sort of periodic behavior with respect to writes vs reads, and a workload that experiences a phase change (e.g. 100% writes -> 50% writes -> 0% writes).

@petermattis
Copy link
Collaborator Author

Regarding pacing of user writes and flushing: I had some time to think about this while waiting at the doctor's office this morning and wanted to capture my thoughts. First some background:

  • Flushing is the act of taking a memtable and writing it to an L0 sstable.
  • Memtables are append-only, and a fixed, contiguous chunk of memory. When a memtable fills up it is written to disk as an sstable to free up memory and to allow the associated portion of the WAL to be released / reused.

The current Pebble code flushes a memtable as fast as possible. This isn't entirely true, because Options.BytesPerSync forces an fsync periodically, but the rate at which a memtable is flushed is not tied to the rate at which user writes are coming in. Why is this problematic? Because flushing consumes CPU and disk IO. Those resources are scarce and having some short burst where flushing is consuming those resources excessively can have an impact on the performance of user writes. But we need to use some of CPU and disk IO to perform flushing. The question is how much? How do we pace flushing vs user writes and vice versa?

We can view the collection of memtables as a bucket that is being filled and drained at the same time. User writes are filling the bucket, while flushing drains the bucket. Draining needs to at least keep up with filling or else we'll overfill our bucket (i.e. run out of memory), but we don't want draining to go significantly faster than filling or the system will experience hiccups in fill performance. Ideal steady state is for the fill rate to equal the drain rate. One wrinkle in this analogy is that we can only drain when there is a full memtable. We want to measure our flush progress below the level of a memtable. This can be accomplished by tracking how much of the memtable has been flushed. There are several ways to estimate this, but one is to keep track of how much of the memtable has been iterated over by tracking the number of bytes in the skiplist nodes, keys and values. The total of all of those bytes seen during full iteration should equal the allocated bytes in the memtable's arena. (It looks like an arenaskl.node.allocSize field will need to be added to track this. Perhaps there is another way.)

Back to the bucket draining analogy. We're limiting our draining rate to the fill rate. But if the drain rate is already at maximum and is not keeping up, we need to limit the fill rate. What control mechanism do we use here? Using some sort of rate.Limiter on the fill rate is problematic because something could have caused the draining to completely block and we don't want to overfill our bucket and dynamically adjusting the rate.Limiter to copy with this seems fraught. Instead, I propose using some sort of quota mechanism. Filling reserves quota and draining (flushing) releases it. Rather than thinking about this as write quota, I think it will be easier to think about this as "dirty bytes" and having a maximum number of dirty bytes allowed in memtables. When the dirty bytes count hits a threshold, user writes have to wait for the flusher to "clean" a portion of a memtable via flushing. As mentioned above, the dirty bytes counts are logical as memtables are physically allocated in a single block of memory and flushing occurs on an entire memtable. Cleaning of a portion of a memtable during flushing happens as the memtable is iterated over and transformed into an sstable.

The dirty bytes threshold would be set to 110% of the memtable size. When the first memtable is filled it is handed off to flushing and writes can proceed at full speed until the remaining 10% of the dirty bytes quota is exhausted. At that point, user writes are paced to the speed of flushing. As soon as dirty bytes are cleaned by the flusher, blocked user writes would be signaled to proceed.

What about the draining speed? Recall that we want to limit the drain rate to the fill rate. Note that the drain rate is not constrained by the fill rate. Without any limiting flushing should usually be able to "drain" a memtable much faster than user writes can fill a new one. Draining should target a dirty bytes value that is ~100% of a memtable's size. That is, the steady state for user writes and flushing should be one memtable's worth of dirty data. If the flusher falls behind this target it will speed up until it is going full throttle and eventually user writes will be stopped, though note how this happens on a granular basis. If the flusher is going too fast (e.g. dirty bytes falls below 90% of a memtable's size) it will slow itself to some minimum flush rate (e.g. 2 MB / sec).

There are some additional complexities, such as the handling of flushableBatches which occur when a batch is too large to fit in a memtable. It is likely that we'll want to allow the dirty bytes threshold to be configurable, and sometimes larger than 110% of a memtable's size. For example, there is an interesting configuration where the memtable size is small so that large batches are always transformed to flushableBatches via a sorting step which is faster than adding to a memtable.

To summarize:

  • Flushing targets a dirty bytes threshold of X (at least the size of a memtable) and slows down when the dirty bytes count falls below 90% * X.
  • User writes are paused when dirty bytes exceed 110% * X.
  • A user write to the memtable increases the dirty byte count.
  • The dirty byte count is decreased during flushing when an entry is added to the sstable.

@ajkr Would appreciate your thoughts on this. I haven't fully thought through how this would interact with pacing of compactions, but the hazy outline I have in mind is that compaction pacing can be tied to the flush speed. So we'd have compactions pacing flushes which in turn pace user writes.

@petermattis
Copy link
Collaborator Author

I wrote a little simulator for the above and the interactions seem to work well. Here is a 30s snippet:

filling at 10 MB/sec
_elapsed___chunks____dirty_____fill____drain
      1s        1     20.0     20.0      0.0
      2s        1     30.0     10.0      0.0
      3s        1     40.0     10.0      0.0
      4s        1     50.0     10.0      0.0
      5s        1     60.0     10.0      0.0
      6s        2     63.6     10.0      6.4
      7s        2     64.0     10.0      9.6
filling at 8 MB/sec
      8s        2     64.0      8.0      8.0
      9s        2     64.0      8.0      8.0
     10s        2     64.0      8.0      8.0
     11s        2     64.0      8.0      8.0
     12s        2     64.0      8.0      8.0
     13s        2     64.0      8.0      8.0
     14s        2     64.0      8.0      8.0
filling at 4 MB/sec
     15s        2     64.0      4.0      4.0
     16s        2     64.0      4.0      4.0
     17s        2     64.0      4.0      4.0
     18s        2     64.0      4.0      4.0
     19s        2     64.0      4.0      4.0
     20s        2     64.0      4.0      4.0
_elapsed___chunks____dirty_____fill____drain
     21s        2     64.0      4.0      4.0
filling at 1 MB/sec
     22s        2     61.0      1.0      4.0
     23s        2     58.0      1.0      4.0
     24s        2     55.0      1.0      4.0
     25s        2     52.0      1.0      4.0
     26s        2     49.0      1.0      4.0
     27s        2     46.0      1.0      4.0
     28s        2     43.0      1.0      4.0
     29s        1     44.0      1.0      0.0
     30s        1     45.0      1.0      0.0

There is one fill and one drain goroutine. Every 5-10s the fill goroutine changes the rate at which it is filling (in the range 1-20 MB/sec). The drain goroutine targets have 64 MB of dirty data, slowing itself down to a minimum drain rate of 4 MB/sec if it is going too fast.

The code for the above is here.

@petermattis
Copy link
Collaborator Author

For comparison, here is what the simulation does if you don't pace draining to filling:

filling at 10 MB/sec
_elapsed___chunks____dirty_____fill____drain
      1s        1     20.0     20.0      0.0
      2s        1     30.0     10.0      0.0
      3s        1     40.0     10.0      0.0
      4s        1     50.0     10.0      0.0
      5s        1     60.0     10.0      0.0
      6s        1      6.0     10.0     64.0
      7s        1     16.0     10.0      0.0
filling at 8 MB/sec
      8s        1     24.0      8.0      0.0
      9s        1     32.0      8.0      0.0
     10s        1     40.0      8.0      0.0
     11s        1     48.0      8.0      0.0
     12s        1     56.0      8.0      0.0
     13s        2     64.0      8.0      0.0
filling at 4 MB/sec
     14s        1      8.0      8.0     64.0
     15s        1     12.0      4.0      0.0
     16s        1     16.0      4.0      0.0
     17s        1     20.0      4.0      0.0
     18s        1     24.0      4.0      0.0
     19s        1     28.0      4.0      0.0
     20s        1     32.0      4.0      0.0

petermattis added a commit that referenced this issue Apr 15, 2019
This code was not actively being used and the mechanisms were
fragile. See #7 for the proposed direction for pacing user writes vs
flushes.
@ajkr
Copy link
Contributor

ajkr commented Apr 16, 2019

I had some rough thoughts when thinking through this, though no concrete suggestions. Overall I like the idea of backpressure making its way from the background to the foreground. It reminds me of the gear-and-spring scheduler you mentioned a while ago where backpressure went bottom-up in the LSM eventually making its way to the memtable writes.

  • Number of L0 files and pending compaction bytes are usually related but treated separately by RocksDB. For example, a case where they aren't related is a write burst just ended and an L0->base compaction also just finished, at which point pending compaction bytes is high but L0 file count is not. In that situation space-amp could be very high while there's no backpressure on incoming writes. So, I wonder if we should also be estimating pending compaction bytes and backpressuring based on it.
  • I like the idea of L0 file count backpressuring memtable drain, which in turn backpressures memtable fill. It seems like it should naturally accommodate write bursts. Perhaps pending compaction bytes can also backpressure memtable drain - we could drain memtable at the min rate determined by these two mechanisms.
  • Will writers see stable latencies in slowdown mode if they are waiting on bytes to free up from the flush thread? My understanding is flush thread will periodically report 512KB is drained (according to BytesPerSync) at which point 512KB of writes can proceed. But it seems like only the first write of those 512KB is delayed. I wonder if we can spread it over all of them.
  • I wonder if there will still be knobs to turn. For example, one might wish to sacrifice memory for write burst ability by tuning the dirty bytes threshold from 110% to something higher.

@petermattis
Copy link
Collaborator Author

@ajkr All good points. I've been thinking about how to pace memtable draining to compactions, but still don't have a clear picture of how that will work.

Will writers see stable latencies in slowdown mode if they are waiting on bytes to free up from the flush thread? My understanding is flush thread will periodically report 512KB is drained (according to BytesPerSync) at which point 512KB of writes can proceed. But it seems like only the first write of those 512KB is delayed. I wonder if we can spread it over all of them.

I was imagining that the flush thread increment the drained memtable bytes as each entry is added to the memtable. Yes, there will be a small blip whenever a sync threshold is passed (though the use of sync_file_range on Linux might eliminate that).

I wonder if there will still be knobs to turn. For example, one might wish to sacrifice memory for write burst ability by tuning the dirty bytes threshold from 110% to something higher.

Yeah, the 110% was just something pulled from the air. This would likely be a tunable.

@petermattis
Copy link
Collaborator Author

The code for the spring&gear scheduler described in the bLSM paper is on github. It is a bit of a mess (IMO). I think the mergeManager::tick is where most of the work happens. Hard to imagine trying to adapt this code as-is. Re-reading the description of the bLSM spring&gear scheduler and it is essentially what I describe above:

We modify the gear scheduler based on the observation that snowshoveling (and more sophisticated partition-based schemes) expose a more natural progress indicator: the fraction of C0 currently in use. The spring and gear scheduler attempts to keep the size of C0 between a low and high water mark. It pauses downstream merges if C0 begins to empty and applies backpressure to the application as C0
fills.

In the bLSM terminology, C0 is the memtable. Hard to tell what the bLSM code is actually doing. It has wonderful stuff like double slp = 0.001 + 5.0 * delta; //0.0015 < slp < 1.112111.., with absolutely no explanation of the constants. I'm sure we could reverse engineer, though I doubt it is worth the effort.

petermattis added a commit that referenced this issue May 7, 2019
Simulation of filling and draining a bucket in chunks with
bi-directional flow control. If draining is proceeding faster than
filling, then draining is slowed down (but kept at a minimum rate so it
doesn't stop completely). If filling is proceeding faster than draining,
it is slowed down.

The simulation is meant to represent filling and draining of the
memtable. Each memtable is a chunk and draining (flushing) of a memtable
cannot proceed until the memtable is full. The simulation periodically
adjusts the rate of filling and demonstrates the nimble reaction of the
flow control mechanism.

See #7
@luochen01
Copy link

Hello,

I'm a PhD student from UC Irvine working on LSM-trees. I found this very interesting project from Mark's small datum blog. In particular, this issue is very related to what I'm currently working on. Thus, I would like to share some thoughts, which I hope can be mutually helpful.

I have been working on minimizing write stalls of LSM-trees (with various designs) under a given disk bandwidth budget for I/O operations (flushes and merges). It seems that this issue tries to address the other side of my problem: how the disk bandwidth budget should be set for a given user workload? Despite the difference, I believe they have many in common as well and my thoughts on this topic are as follows:

  1. I think bLSM is attacking a wrong problem. The write latency in a real system usually contains two parts: queuing time and processing time. The processing time measures how long it takes to write data into the LSM-tree (memtable), but the queuing time measures how long it takes before a write is served. In most practical systems where latency matters (or open systems as in queueing theory), since the LSM-tree has no control over the write arrival rate, new writes must be queued if they cannot be processed immediately. Similarly, the statement "But it seems like only the first write of those 512KB is delayed. I wonder if we can spread it over all of them." (IMO) is incorrect because all subsequent writes will be delayed (implicitly via queuing) as well. When looking at the write latency and write stalls, it would be important to consider queueing time as well. Under this setting, I found that delaying writes gracefully is actually not helpful for reducing write latencies. Jeff Dean's paper "The Tail at Scale" may also be a good material to look at.

  2. For estimating the disk bandwidth budget for a given user workload (a.k.a rate limiting or I/O throttling), I think a reasonable approach is to use some closed-loop feedback control mechanism. The key idea is that the disk bandwidth budget directly controls the maximum write throughput of an LSM-tree (which is the output of the system). Given a data arrival rate (which is the reference input of the system), we want the maximum write throughput to be slightly (10%-15%?) larger than the data arrival rate so that the background writes can be smooth. The challenge here is that it is not straightforward to measure/estimate the maximum write throughput of an LSM-tree (there are some research papers on this topic, but may be hard to be applied to real workloads due to skews). Thus, one possible solution is to look at the number of SSTables in L0 so that its number is controlled within a reasonable range. I think looking at the memtable level may be fragile since flushes only account for a negligible amount of total disk I/Os and background merges are slightly delayed after flushes. Also, RocksDB has a feature called auto-tuned rate limiter (https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html), which may be related to this issue.

I hope these thoughts can be helpful, and please kindly correct me if my thoughts are wrong, especially in an industrial system setting such as CockroachDB.

@petermattis
Copy link
Collaborator Author

Hi @luochen01. Thanks for the thoughts. Glad to hear more attention is being given to avoiding write stalls in LSMs. I think this is an interesting problem. Do you have anything additional you can share about the directions you are exploring?

Similarly, the statement "But it seems like only the first write of those 512KB is delayed. I wonder if we can spread it over all of them." (IMO) is incorrect because all subsequent writes will be delayed (implicitly via queuing) as well.

Yes, all subsequent writes will be delayed. The queueing is actually fairly explicit when committing a write to the LSM (via the write to the WAL), but note that writes to the WAL are batched, while writes to the memtable are concurrent. I see two effects of concurrent flushes and compactions on user write performance:

  1. The flush/compaction consumes IO resources which slows down the WAL write.
  2. The flush/compaction consumes CPU resources which slows down the memtable insert. Interestingly, the memtable insert CPU usage is actually a limiter at high throughput. This can be demonstrated by disable the WAL and then running a periodic background process that consumes CPU while writing to the DB.

I think a reasonable approach is to use some closed-loop feedback control mechanism

Do you have more specifics on how to structure this? The available disk bandwidth can change unexpectedly. On cloud VMs this can be due to hitting write quota limits. Or it could be due to another process on the machine suddenly slurping up disk bandwidth.

I found that delaying writes gracefully is actually not helpful for reducing write latencies.

The goal isn't to reduce average write latencies, only tail write latencies. I did an experiment that found that rate limiting user writes significantly reduced tail latencies without affecting average latencies. The problem is how to configure that rate limit.

The challenge here is that it is not straightforward to measure/estimate the maximum write throughput of an LSM-tree (there are some research papers on this topic, but may be hard to be applied to real workloads due to skews).

I'm not aware of papers which attempt to estimate maximum write throughput of LSMs. Can you point me towards them? I am aware of papers that attempt to estimate write-amp, but that is a different metric.

Thus, one possible solution is to look at the number of SSTables in L0 so that its number is controlled within a reasonable range.

This is what RocksDB does with its l0_slowdown_writes_threshold and l0_stop_writes_threshold options. Those options are exactly what I'm looking to improve upon. Or am I misunderstanding you?

I think looking at the memtable level may be fragile since flushes only account for a negligible amount of total disk I/Os and background merges are slightly delayed after flushes.

It might not have been clear from my earlier messages on this issue: a full solution here would involve pacing compactions as well. https://github.com/petermattis/pebble/issues/7#issuecomment-480106971 only talks about pacing flushes, which I agree would be insufficient on its own.

@luochen01
Copy link

luochen01 commented May 17, 2019

Hi @petermattis , thanks for your replies. I'm currently preparing a paper to summarize my results of the write stall problem. I could share it when it's ready.

The flush/compaction consumes IO resources which slows down the WAL write.

This seems to be a wrong configuration. From my experiments (and as a standard industrial practice), WAL should be configured using a dedicated disk. Its periodic disk forces are very harmful to disk throughput (or you may have turned off disk forces at a risk of data loss)

Do you have more specifics on how to structure this? The available disk bandwidth can change unexpectedly. On cloud VMs this can be due to hitting write quota limits. Or it could be due to another process on the machine suddenly slurping up disk bandwidth.

The feedback loop monitors the write rate and estimates the disk bandwidth budget for background I/Os. It can use L0 SSTables as an estimator of the write pressure. For example, it can increases the disk bandwidth budget if L0 SSTables start pile up and decrease it vice versa. This budget should be applied in a relatively long-term. There could be some variance of the instantaneous disk bandwidth, but that could be absorbed over time. Also, one good property with control theory (e.g., integral control low) is that it can reject the disturbances from the environment. I'm not an expert in this area, but it seems that this could be a reasonable approach.

The goal isn't to reduce average write latencies, only tail write latencies. I did an experiment that found that rate limiting user writes significantly reduced tail latencies without affecting average latencies. The problem is how to configure that rate limit.

I doubt this finding. It would be good if you can share your experimental results, including the workload. I'm not sure how you did it, but it is important to decouple the data arrival rate from the processing rate (e.g., clients put data into queue while LSM processes it; or YCSB uses timestamps to implement this feature without explicit queuing, i.e., the intended update time).

I'm not aware of papers which attempt to estimate maximum write throughput of LSMs. Can you point me towards them? I am aware of papers that attempt to estimate write-amp, but that is a different metric.

I'm actually talking about estimating write-amp, and sorry for the confusion! (Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs). Actually, given write-amp and some statistics of records, it should be straightforward to estimate the maximum write throughput (by dividing write throughput with write-amp)

This is what RocksDB does with its l0_slowdown_writes_threshold and l0_stop_writes_threshold options. Those options are exactly what I'm looking to improve upon. Or am I misunderstanding you?
It might not have been clear from my earlier messages on this issue: a full solution here would involve pacing compactions as well. #7 (comment) only talks about pacing flushes, which I agree would be insufficient on its own.

When I read through this issue, I saw many descriptions about how to control flush speed based on dirty bytes, and I was misled a bit on that. Actually, I'm not sure controlling the flush speed is needed. It might be just enough to control the overall disk bandwidth of all I/O operations, which in turn controls the maximum write throughput.

Thanks again for your clarification!

@petermattis
Copy link
Collaborator Author

This seems to be a wrong configuration. From my experiments (and as a standard industrial practice), WAL should be configured using a dedicated disk. Its periodic disk forces are very harmful to disk throughput (or you may have turned off disk forces at a risk of data loss)

By disk forces I assume you mean disk syncs. No, we do not disable them. Storing the WAL on a separate disk is a good performance practice, yet it isn't always possible. We have to design for situations in which there is only a single disk available. Even if a separate disk is available, we still have to design for when bandwidth to that disk varies unexpectedly.

The feedback loop monitors the write rate and estimates the disk bandwidth budget for background I/Os. It can use L0 SSTables as an estimator of the write pressure. For example, it can increases the disk bandwidth budget if L0 SSTables start pile up and decrease it vice versa. This budget should be applied in a relatively long-term. There could be some variance of the instantaneous disk bandwidth, but that could be absorbed over time. Also, one good property with control theory (e.g., integral control low) is that it can reject the disturbances from the environment. I'm not an expert in this area, but it seems that this could be a reasonable approach.

The problem I see with using L0 sstable count as a signal is that the L0 sstable jumps around quickly. You could try to smooth this signal, but why not use a signal that varies smoothly in the first place. Using a feedback loop that controls bandwidth feels like trying to perform surgery with mittens on. Perhaps that is an extreme analogy, but my point is that we can have much more granular control.

I doubt this finding. It would be good if you can share your experimental results, including the workload. I'm not sure how you did it, but it is important to decouple the data arrival rate from the processing rate (e.g., clients put data into queue while LSM processes it; or YCSB uses timestamps to implement this feature without explicit queuing, i.e., the intended update time).

It is good to doubt something which doesn't match up with your internal understanding. I do that all the time. Unfortunately, I don't have the results from that experiment around any longer. Yes, the workload I used coordinated arrival and processing times. Clearly rate limiting in such an environment can reduce the tail latencies for the processing times, but it might do so at the expense of throughput. My contention is that the rate limiting actually allowed more overall throughput to be achieved in steady state (though peak throughput was limited). I should note that I was also rate limiting the flush and compaction rate.

When I read through this issue, I saw many descriptions about how to control flush speed based on dirty bytes, and I was misled a bit on that. Actually, I'm not sure controlling the flush speed is needed. It might be just enough to control the overall disk bandwidth of all I/O operations, which in turn controls the maximum write throughput.

If you're imagining that flush speed is controlled by some overall disk bandwidth limit, then I think you are controlling flush speed, just by an indirect mechanism.

@luochen01
Copy link

The problem I see with using L0 sstable count as a signal is that the L0 sstable jumps around quickly. You could try to smooth this signal, but why not use a signal that varies smoothly in the first place. Using a feedback loop that controls bandwidth feels like trying to perform surgery with mittens on. Perhaps that is an extreme analogy, but my point is that we can have much more granular control.

To be more accurate, L0 sstables reflect how fast data arrive and how fast data are merged down to the bottom level. I'm not sure why the L0 sstable jumps around quickly. In a steady workload, the arrival of new L0 sstables should be relatively low, as most of the background I/Os are spent on merges. It is true that the feedback loop approach treats the LSM-tree as a black box and thus better control may be performed by examining the internal structures and data flows of an LSM-tree. However, the advantage of the feedback loop approach is that it is simple to implement and it usually works very well (this approach may be first introduced into the DBMS community by DB2 http://www.vldb.org/conf/2006/p1081-storm.pdf). I have also pursued the direction of estimating the write rate limit using analytical/simulation approaches for a given disk bandwidth budget to avoid write stalls, but eventually I gave up the efforts due to its high complexity. I'm very interested in how this problem can be addressed eventually in CockroachDB.

@ajkr
Copy link
Contributor

ajkr commented May 17, 2019

I think bLSM is attacking a wrong problem. The write latency in a real system usually contains two parts: queuing time and processing time. The processing time measures how long it takes to write data into the LSM-tree (memtable), but the queuing time measures how long it takes before a write is served. In most practical systems where latency matters (or open systems as in queueing theory), since the LSM-tree has no control over the write arrival rate, new writes must be queued if they cannot be processed immediately. Similarly, the statement "But it seems like only the first write of those 512KB is delayed. I wonder if we can spread it over all of them." (IMO) is incorrect because all subsequent writes will be delayed (implicitly via queuing) as well. When looking at the write latency and write stalls, it would be important to consider queueing time as well.

This makes sense to me. Thanks a lot for the explanation. I guess we should separate the query client logic out of the threads that execute storage engine operations (or use the YCSB trick). This is different from the tools I've worked with before, mainly RocksDB's db_bench.

Under this setting, I found that delaying writes gracefully is actually not helpful for reducing write latencies. Jeff Dean's paper "The Tail at Scale" may also be a good material to look at.

Even if it doesn't help write latencies, it may still be worthwhile to delay writes for limiting read-amp and space-amp. Let's say we want some limit on number of sorted runs (number of memtables plus number of L0 files plus number of L1+ levels) to bound read-amp. Would it be just as good to have a hard limit (i.e., stop writes) when we hit the sorted run limit, rather than gradually slowing down as we approach it? I am still thinking about it, but am starting to believe the hard limit alone is a fine choice.

Given a data arrival rate (which is the reference input of the system), we want the maximum write throughput to be slightly (10%-15%?) larger than the data arrival rate so that the background writes can be smooth.

Maybe I misunderstood the terminology here, but I found this surprising. My understanding is data arrival rate measures how fast we are inserting, and maximum write throughput measures total writes to WAL, L0, L1, etc. Write-amp is typically 10+ so I am not sure how an LSM can be sustained by writing at only 1.1-1.15x the insertion rate.

BTW, I enjoyed your LSM survey paper, particularly the part asking authors to evaluate against a well-tuned LSM :). Looking forward to what's next.

@ajkr
Copy link
Contributor

ajkr commented May 17, 2019

Actually, given write-amp and some statistics of records, it should be straightforward to estimate the maximum write throughput (by dividing write throughput with write-amp)

OK, I think I see what you mean now about maximum write throughput 10-15% higher than data arrival rate. The ideal disk bandwidth budget would then be W-amp * x * data arrival rate for some x in [1.1, 1.15]. I will think a bit more about the choice of using L0 file count to budget disk bandwidth.

@petermattis
Copy link
Collaborator Author

To be more accurate, L0 sstables reflect how fast data arrive and how fast data are merged down to the bottom level. I'm not sure why the L0 sstable jumps around quickly.

My proposed memtable "dirty bytes" metric is also an indication of how fast data arrives and how quickly it is getting flushed, but it is a much smoother metric as it gets updated on every user write and incrementally as a memtable gets flushed. The L0 sstables metric is chunky in comparison. It grows and shrinks in units of sstables, which are much larger than user writes. With the default RocksDB compaction options, the L0 sstable count grows from 0->4 sstables, then a merge occurs and while that is occurring additional flushes happen so that the number of L0 sstables tends to randomly jump around between 0-8.

In a steady workload, the arrival of new L0 sstables should be relatively low, as most of the background I/Os are spent on merges.

This statement makes me wonder if we're on the same page. The arrival of new sstables occurs at a rate proportional to incoming write traffic. For a write heavy workload, that can happen fairly frequently (on the order of a few seconds). I do agree that a majority of background I/Os are spent on compactions/merges.

@petermattis
Copy link
Collaborator Author

I spent a little time understanding the RocksDB user-write pacing mechanism and thought it would be useful to jot down my notes.

When a batch is committed to RocksDB, part of the "preprocessing" that is performed is to delay the write. This is done in DBImpl::DelayWrite via a call to WriteController::GetDelay. Note that a common case is that no delay is necessary (i.e. no delay token is active).

WriteController::GetDelay uses a token bucket approach to limit user writes to a target bytes-per-sec. The target rate is specified by calling WriteController::GetDelayToken. That method is called from exactly one function: SetupDelay.

SetupDelay is in turn called from RecalculateWriteStallConditions. SetupDelay itself is fairly straightforward. If a delay is needed, there is a max-write-rate (16MB/sec in CockroachDB's configuration) and the current write-rate. SetupDelay sees if progress on compactions is being made by tracking "compaction bytes needed" which is an estimate of the number of bytes that need to be compacted. If compactions are falling behind, the write-rate is decreased by multiplying the current write-rate by 0.8 (this is hardcoded). If compactions are keeping pace (the "compaction bytes needed" is decreasing), the write-rate is increased by 1/0.8 == 1.25.

RecalculateWriteStallConditions is a beast of conditionals checking for various reasons why user-writes should be slowed down. For example:

      // L0 is the last two files from stopping.
      bool near_stop = vstorage->l0_delay_trigger_count() >=
                       mutable_cf_options.level0_stop_writes_trigger - 2;

This is checking to see if the number of L0 files is within 2 of the stop-writes threshold. There are a lot of conditions checked in this method: too many memtables, L0 stop-writes threshold, L0 slowdown-writes threshold, pending compaction bytes threshold, etc.

RecalculateWriteStallConditions is called whenever a new SuperVersion is installed. A new SuperVersion is installed whenever a flush or compaction completes. That means that the pacing of user-writes occurs at the granularity of flush and compaction operations.

The "compaction bytes needed" metric is computed whenever a new Version is created (which occurs when a flush or compaction completes): VersionStorageInfo::EstimateCompactionBytesNeeded. That method makes an estimate of how many bytes will be in a compaction. Interestingly, it also takes into a consideration that a compaction from Ln->Ln+1 will trigger a compaction from Ln+1->Ln+2 and includes that compaction in the estimate of the "compaction bytes needed". L0 is handled specially as it is recognized that it will overlap entirely with Lbase.

To summarize all of the above:

  • RocksDB paces user-writes in its commit pipeline.
  • After a flush / compaction, RocksDB updates a "compaction bytes needed" estimate, then recomputes the stall condition.
  • If any of the stall conditions are active, RocksDB computes the bandwidth to allow for user-writes.
  • The bandwidth for user-writes is adjusted up and down on future calls to update the stall condition depending on whether compactions are falling behind or are keeping up.
  • If there are no stall conditions, the write delay is removed.

@Ryanfsdf
Copy link
Contributor

After playing around with the simulator, thinking about the RocksDB code, and talking to Andrew, I have a few notes and ideas that I think are worth sharing.

First off, I tried to modify the simulator to model RocksDB. In particular, I tried to implement the same logic as SetupDelay. If pending compaction bytes (compaction debt) is increasing, then the write-rate is decreased by multiplying the current write rate by 0.8. If the compaction debt is decreasing, then the write-rate is increased by multiplying the current write rate by 1.25. However, this mechanism seems to have a flaw. If the compaction debt increases and decreases in a sawtooth-like pattern,

    /|     /|     /|     /|
  /  |   /  |   /  |   /  | 
/    | /    | /    | /    |

then the system would think that compaction debt is increasing even though it is in a steady state. This is because the write rates are adjusted simply on the existence of a compaction debt delta. This does not take into account the actual size of the delta. I spoke to Andrew about this and he confirmed that this issue has been brought up a few times before. Thus, it doesn't seem like we can use this approach in Pebble.

This means that we need a new mechanism for detecting changes in compaction debt, which also takes the actual size of the delta into account. One idea which came up was assigning a "budget" for memtable flushing based on number of bytes and not the rate (bytes/second). If we can estimate the write-amp of the system, then we can allow x bytes to be flushed whenever (w-amp-1) * x bytes are compacted. This approach seems reasonable and makes more sense than the approach RocksDB takes, even without taking the sawtooth pattern into account. It also doesn't require any knobs to turn, and we can assign the budget on a very granular level. One drawback is how to accurately determine the write-amp of the system. Relying on historic write-amp seems reasonable, and we can possibly also determine the worst case write-amp as well.

@petermattis Do you have any thoughts on this?

@petermattis
Copy link
Collaborator Author

Thus, it doesn't seem like we can use this approach in Pebble.

I'm assuming you mean that the RocksDB approach is problematic and we should do better. We could implement this approach and see exactly the same behavior as RocksDB, right?

This means that we need a new mechanism for detecting changes in compaction debt, which also takes the actual size of the delta into account. One idea which came up was assigning a "budget" for memtable flushing based on number of bytes and not the rate (bytes/second). If we can estimate the write-amp of the system, then we can allow x bytes to be flushed whenever (w-amp-1) * x bytes are compacted. This approach seems reasonable and makes more sense than the approach RocksDB takes, even without taking the sawtooth pattern into account. It also doesn't require any knobs to turn, and we can assign the budget on a very granular level. One drawback is how to accurately determine the write-amp of the system. Relying on historic write-amp seems reasonable, and we can possibly also determine the worst case write-amp as well.

Relying on either historic or worst case write-amp seems feasible, though I haven't thought about it thoroughly. (I do my best thinking about these heuristics in the car). Rather than historic or worst case estimates of write-amp, we might be able to make a direct estimate based on the current sizes of the levels. Maybe. If I squint that seems possible, though I don't have a concrete proposal for how to do that.

Something also to document here is what I've mentioned to both @Ryanfsdf and @ajkr in person: we should be able to have compaction debt adjust smoothly rather than the chunky adjustment that is done now via updates whenever a flush or compaction finishes. What I'm imagining is that we track how far we have iterated through the input compaction tables and subtract that size from the input levels, and add the size of the new output tables to the output level. If compaction debt is adjusted at a fine granularity like that I'm imagining we could tie flush rate to it so that the flush rate adjustments are all smooth.

@Ryanfsdf
Copy link
Contributor

I'm assuming you mean that the RocksDB approach is problematic and we should do better. We could implement this approach and see exactly the same behavior as RocksDB, right?

Yes, we should expect to see the same behavior. However, I'm skeptical about the correctness of this approach. There are cases when overall compaction debt is increasing but the system would think it's decreasing, and vice versa. That means user writes may be throttled when compaction debt is decreasing and sped up when compaction debt is increasing, contrary to what we want.

Relying on either historic or worst case write-amp seems feasible, though I haven't thought about it thoroughly. (I do my best thinking about these heuristics in the car). Rather than historic or worst case estimates of write-amp, we might be able to make a direct estimate based on the current sizes of the levels. Maybe. If I squint that seems possible, though I don't have a concrete proposal for how to do that.

Something also to document here is what I've mentioned to both @Ryanfsdf and @ajkr in person: we should be able to have compaction debt adjust smoothly rather than the chunky adjustment that is done now via updates whenever a flush or compaction finishes. What I'm imagining is that we track how far we have iterated through the input compaction tables and subtract that size from the input levels, and add the size of the new output tables to the output level. If compaction debt is adjusted at a fine granularity like that I'm imagining we could tie flush rate to it so that the flush rate adjustments are all smooth.

With the smooth compaction debt adjustment, we can set the "target" compaction debt to be between a low and high watermark (ie, size_of_memtable/2 <= compaction_debt < size_of_memtable). Any time we cross the high watermark, we would throttle flushing. And any time we cross the low watermark, we would throttle compactions. This is essentially the same as how we'll handle user writes with memtable flushing. This seems like the ideal approach. Is this similar to what you've had in mind before?

@petermattis
Copy link
Collaborator Author

This is essentially the same as how we'll handle user writes with memtable flushing. This seems like the ideal approach. Is this similar to what you've had in mind before?

Yes, though I hadn't seen it fully spelled out until now.

PS Be sure to keep around the adjustments you've made to the simulator to model RocksDB's heuristics. We'll want to use that as evidence that this new pacing mechanism is an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants