rfc: disaggregated shared storage #70419

sumeerbhola · 2021-09-20T14:22:44Z

This RFC motivates and outlines a design to share the files of a
log-structured merge tree across nodes, for potential cost
reduction and higher storage density. More importantly it allows
for cheap rebalancing, full backup/restore, and branching of data.

Release note: None

cockroach-teamcity · 2021-09-20T14:22:51Z

This change is

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 24 at r1 (raw file):

significant amount of cold state (not often being touched by reads or
writes), since it reduces their cost. We believe that there are OLTP
contexts where older state is cold e.g. old orders in order

I skimmed the RFC and I don't see an acknowledgement of the fact that a piece of data being part of the higher levels of the LSM does not mean that it's not being read frequently. It seems to me that, in order to implement tiered storage, our caching of frequently-read data needs to be damn good.

I always assumed that the way we'd start doing tiered storage would be the more traditional way which other databases offer - allow different storage classes to be assigned to different table partitions. That's one of the main things zone configs were originally supposed to support (but of course we never actually used them for that). The recent introduction of table-level partitioning (as opposed to the traditional index-level partitioning) I think makes it much more realistic to use zone configs for specifying a partition-level storage option. Of course, working at the level of partitions wouldn't achieve your objective of making all ranges more nimble and easier to move around. But it would increase a cluster's data density.

docs/RFCS/20210914_disaggregated_storage.md, line 722 at r1 (raw file):

this cache:

- Which node is responsible for caching it: We embed the node number

My instinct is to separate the cache into a new service, separate from CRDB nodes. The simplicity of having a single binary and a flat deployment story with nodes that do everything is not important in the managed service land.

petermattis

Interesting read. I'm still digesting this, though I've left some initial comments below.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 544 at r1 (raw file):

years.

Let us assume that this highest seqnum is N -- it may be the cyrrent

Nit: s/cyrrent/current/g

docs/RFCS/20210914_disaggregated_storage.md, line 585 at r1 (raw file):

for an sstable could be equal to the number of keys in that sstable.

#### Solution

I'm a bit confused by this section, possibly because I'm not understanding how these foreign sstables are imported. With sstable ingestion we assign a global seqnum to the records within the sstable at ingestion time. This is cleverly done by storing the global seqnum in the MANIFEST, not the ingested sstable. Don't we need to do something similar with these foreign sstables? Sstable ingestion gets away with a single global seqnum for the records in an sstable, but I think it is fine if we need a few seqnums as you note below. What I'm confused by is the reference to M? Is that the current seqnum of the LSM that is being imported into? I think it must be, otherwise importing a foreign sstable could rewrite history of the LSM.

docs/RFCS/20210914_disaggregated_storage.md, line 642 at r1 (raw file):

  sstables in N1's LSM that overlap with R. Note that these files will
  not be deleted while the time bounded ref exists. The ref also
  prevents deletion of memtables and local sstables for that version.

This sounds similar to the implicit snapshot created by iterators that prevent deletion of sstables that are being iterated over.

docs/RFCS/20210914_disaggregated_storage.md, line 670 at r1 (raw file):

    below the SINGLEDEL can together be converted to a SETWITHDEL.

- N2 imports the shared sstables for which it has acquired

Is this import operation similar to an ingest except that it is placing key span constraints on the imported table?

docs/RFCS/20210914_disaggregated_storage.md, line 709 at r1 (raw file):

use an empty reference file with the name of the reference holder in
the reference file name. A file with n references would result in n+1
files, the original containing the data, and n reference files.

I worked on a previous distributed system which used empty files as reference markers. One challenge we might experience here is what the semantics around files are in S3 or GCS. After creating an empty reference file, when are we guaranteed it exists?

docs/RFCS/20210914_disaggregated_storage.md, line 744 at r1 (raw file):

An LSM with shared sstables may want to cache some part of all these
sstables. For example, caching the footer can avoid repeated reading

The RocksDB secondary cache does the type of local caching. We could look to it for inspiration. I don't recall the caching being particularly complex.

jbowens

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 330 at r1 (raw file):

approach. There is a manifest maintained in the local-store that is
incrementally updated as new *versions* of the LSM are installed. The
manifest refers to two kinds of sstbles (as expected):

nit: s/sstbles/sstables/

docs/RFCS/20210914_disaggregated_storage.md, line 421 at r1 (raw file):
One fortunate implication extending from this unfortunate implication, is that we will no longer suffer the write amplification from compacting range-removal range tombstones that cannot drop data due to LSM snapshots (eg, cockroachdb/pebble#872).

Do you have any ideas on how to prevent misuse of LSM snapshots given this change in their semantics?

A foreign sstable can have multiple versions of the same key because
of an LSM snapshot in the LSM where it was a native sstable.

Can we disallow LSM snapshots from preserving keys in shared sstables? Down below you talk about preventing deletion of memtables and local sstables on the source node during rebalancing. It seems like that implicit iterator snapshot could be used to eliminate use of LSM snapshots within CockroachDB when configured to share sstables.

docs/RFCS/20210914_disaggregated_storage.md, line 516 at r1 (raw file):

The full story for reads is somewhat more complicated. The
`InternalIterator` tree that contains the heart of the logic for
iteration does exposes the seqnums for each point. The iterator that

nit: s/does exposes/does expose/

docs/RFCS/20210914_disaggregated_storage.md, line 632 at r1 (raw file):

virtual file.

## Initiliazing range state at different node

nit: s/Initiliazing/Initializing/

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 23 at r1 (raw file):

Quoted 6 lines of code…

The maximum recommended storage capacity per node is
[2.5TB](https://www.cockroachlabs.com/docs/stable/recommended-production-settings.html#storage). Users
would like the storage capacity to be much higher, especially with
significant amount of cold state (not often being touched by reads or
writes), since it reduces their cost. We believe that there are OLTP
contexts where older state is cold e.g. old orders in order

My sense is that we need to unpack the underlying reasons for this maximum density. My sense is that it is much more about the in-memory and processing overhead of the numbers of replicas rather than the bytes on disk per node.

I empathize with the rebalancing latency goals but I don't see the direct tie-in to increasing density.

Assuming that number of replicas per core is dominant factor for per-core density, then it feels like a project to increase density needs to be focused on a project to increase range sizes. I appreciate that this RFC attacks a key problem with increasing range sizes which is that rebalancing of large ranges gets slower as the ranges get larger.

I worry that there are other problems with actually achieving larger ranges. My hunch is that many tenants will be small. In multi-region, it's even worse. I suspect that many tenants will be small and will require a number of ranges. If a large percentage of replicas correspond to small amounts of data, then this RFC is not going to help, really, at all. I put forward #65726 as a potential approach to mitigate the propagation of numerous small ranges within a tenant.

If our biggest threat model is the case that tenants are very cold, would a better solution be to fully pop their state out of the system into s3 and remove the raft group?

docs/RFCS/20210914_disaggregated_storage.md, line 77 at r1 (raw file):

Our current approach of placing the state machine state on external
replicated block devices, like EBS or GCP PD, means that there is a
multiplier effect on number of copies of data: N range replicas * M

nit: add a \* here to escape the * which leads to everything being italicized after this point.

nvanbenschoten

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @jordanlewis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 23 at r1 (raw file):

then it feels like a project to increase density needs to be focused on a project to increase range sizes

I don't know that I agree with this statement quite as it is phrased, but appreciate the larger point you're bringing up here. If increasing data density is a goal then there may be more direct ways to achieve this, at least for the first 2-4x increase in density, and maybe even beyond.

My understanding matches yours that the maximum density is more a function of the per-range CPU overhead than it is the number of bytes on disk. This was the case back when ranges maxed out at 64MB and are likely still the case now that ranges have a max size of 512MB. What I haven't followed is why that 8x increase to range size did not translate to an 8x increase in maximum recommended data size per node. If I recall correctly, we were recommending 1-2TB/node even before the 8x increase. It has also always been suspect that CPU overhead is cited as the limiting factor, but then we document an unscaled maximum capacity per node, instead of per vCPU.

Maybe the right question to ask is whether we have a comprehensive understanding of the density limits that we're hitting today. I remember hearing that we were going to investigate this a few months ago, but don't recall hearing about any findings since. Including some of those findings here would either point to low-hanging fruit or help better motivate the architectural changes proposed in this RFC.

docs/RFCS/20210914_disaggregated_storage.md, line 24 at r1 (raw file):

I skimmed the RFC and I don't see an acknowledgement of the fact that a piece of data being part of the higher levels of the LSM does not mean that it's not being read frequently.

This also tripped me up. If a sequential scan in an LSM needs to merge across all levels, then the assumption of temporal locality that a hierarchical caching strategy depends on falls apart. But perhaps this is rescued by the relative frequency of churn in each LSM level. Is the block cache more effective for lower-level SSTs because they get replaced less frequently? Or maybe the temporary locality is still there, but only as it relates to the frequency of access during background compactions.

Also, it's worth pointing out that thanks to @jordanlewis's work in #61583, we do issue point reads these days, so bloom filters are now usable in more cases.

docs/RFCS/20210914_disaggregated_storage.md, line 29 at r1 (raw file):

translate to wider adoption. A combination of lower cost storage for
range replica state, and increase in storage capacity per node, would
achieve lower costs.

Even if you don't go into detail, consider referencing cloud pricing somewhere in this doc to give a rough idea of the relative costs of different cloud storage alternatives. It wasn't clear to me until after a bit of research that "lower cost" translates to roughly a 75% savings when comparing SSD-backed EBS with S3. Though I imagine there are other costs that just cost per byte at rest. Still, that's useful context for readers.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @jordanlewis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 24 at r1 (raw file):
My reading of the below [O1] is not about the "block cache" but about caching SSTs at lower layers which are accessed frequently in the local EBS filesystem.

[O1] Transparently improve availability for non-cold data by caching
it in block storage. Users would need to be aware that this
transparent caching cannot make perfect decisions on what to cache,
so there could be occassional drops in availability.

This has the words "block" and "cache" but I don't think it's saying that the only caching in this strategy would be the thing we call the "block cache". My sense is that it's saying we could store whole warm SSTs locally and thus have exactly the perf we have today without worrying that an individual "block cache" cache miss would mean we have to go to the blob store.

bdarnell

Very interesting idea. I remember there were some discussions along these lines a few years ago but I can't find the email thread (or issue or doc) except for a brief mention in the "September 2019 Core Product Council Notes" (which you should be able to find in the internal google groups).

At the time the discussion focused on whether we could reduce the operating costs of CockroachCloud by reducing the replication factor and using more "cloud-native" storage systems instead of writing to a filesystem on a network-attached block device. Instead of building something into rocksdb/pebble we were talking about replacing the entire storage layer (such as having an LSM per range with the WAL stored in something like AWS kinesis or SQS and SSTables in S3).

Other related work is that rocksdb has an HDFS-backed mode and rockset has a rocksdb-cloud library that is backed by s3 and others. I don't think these do the shared-sst ingestion that you're proposing but they might be useful references for how to robustly use s3 and friends, how to do caching, etc.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @jordanlewis, @nvanbenschoten, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 23 at r1 (raw file):

It has also always been suspect that CPU overhead is cited as the limiting factor, but then we document an unscaled maximum capacity per node, instead of per vCPU.

The old recommendation didn't scale with vCPU count because I think we found some store-level mutex to be the limiting factor at the time. I'd hope we've fixed that since then so our next investigation of data density limits should not start with the assumption that the limit is per-node instead of per-vcpu.

Maybe the right question to ask is whether we have a comprehensive understanding of the density limits that we're hitting today.

I do not, but I tend to agree that without further evidence this isn't what I'd pursue if the primary goal is to increase data density per node.

docs/RFCS/20210914_disaggregated_storage.md, line 55 at r1 (raw file):

certain that this approach will actually be cheaper.

One concern with this approach is that while the durability SLO for

I don't see anything in this doc about the consistency properties of shared storage (e.g. if write A happens before write B is it possible to read B without seeing A?). These services don't have the same properties as a posix filesystem; do they work the way we need them?

S3 and GCS (etc) have very similar interfaces, but different implementations under the hood. Are there any differences in their SLAs, performance, or consistency guarantees that we need to think about here?

docs/RFCS/20210914_disaggregated_storage.md, line 129 at r1 (raw file):

## Design constraints and prerequisites

Given the lower availability SLO for shared storage, we must not place

This feels like a significant limitation to me - this is what prevents us from using this work to reduce the replication factor. I'd look at rockset-cloud to see if they have any good ideas for this. One possibility is to use something other than s3 for the WAL - would e.g. SQS or kinesis have the right properties for this?

andreimatei

Very interesting idea. I remember there were some discussions along these lines a few years ago but I can't find the email thread (or issue or doc) except for a brief mention in the "September 2019 Core Product Council Notes" (which you should be able to find in the internal google groups).
At the time the discussion focused on whether we could reduce the operating costs of CockroachCloud by reducing the replication factor and using more "cloud-native" storage systems instead of writing to a filesystem on a network-attached block device.

Perhaps you're thinking of this thread?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @jordanlewis, @nvanbenschoten, and @sumeerbhola)

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, @jordanlewis, @nvanbenschoten, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 679 at r1 (raw file):

We have already guaranteed that N2 has no data for R when it dropped the range (or never had any of it in its history).

Is this considering the case where a follower replica falls behind in the Raft log and needs a snapshot to catch back up? In such cases, it will have data in its LSM (in both local and shared levels). So in those cases, do we need to clear out the keyspace first? If so then we'll eventually need to talk about how to make that atomic with the ingestion of the rest of the snapshot. We had to jump through some hoops (e.g. allowing range del tombstones in ingested ssts) to achieve that with today's design. But we could also explore other schemes that include more blunt coordination between foreground traffic (e.g. follower reads) and raft snapshots above the level of the LSM if we can't get the LSM updates to be entirely atomic. Atomicity also helps with untimely crashes midway through a snapshot, which is another consideration.

This raises another question that I have that comes out of the "Dropping range state from a node" section. I think you might have answered it elsewhere, but I'm not sure. We talk about writing "point deletes or range deletes to delete data" in L0-L4 when dropping a range from a node, and then unlinking the virtual sstables from L5 and L6. If we're writing these deletions to the top of the LSM, then they won't immediately clear out the keyspace in L0-L4 unless we force compactions all the way through these levels. So how can we make assumptions about an empty keyspace later on when re-adding an overlapping range? And how can we ensure that these deletions from the previous incarnation of the range don't shadow the shared sstables in L5 and L6 when the range is re-added? Is the solution to this somewhere in the "Manufacturing sequence numbers for reads" section?

docs/RFCS/20210914_disaggregated_storage.md, line 726 at r1 (raw file):

  name. By default, that is the node responsible for it. If the node
  is dead, or the node requesting the file from the cache is in a
  different region, consistent hashing using the nodes in the region

I'd be interested in hearing more about the interaction between shared sstables and region boundaries. I imagine that any shared sstable would be scoped to a single region and that the underlying block storage would be configured to use regional buckets. How does this interact with rebalancing between replicas across region boundaries?

tbg · 2021-09-23T09:38:15Z

There's also this thread: Socrates/Rockset storage architecture brain dump <https://groups.google.com/a/cockroachlabs.com/g/kv/c/yP8civi0qmU/m/bJp72MrxAQAJ> which also has some Slack links.

…

On Thu, Sep 23, 2021 at 7:48 AM Nathan VanBenschoten < ***@***.***> wrote: ***@***.**** commented on this pull request. *Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/70419>* status: [image:

] complete! 0 of 0 LGTMs obtained (waiting on @ajwerner <https://github.com/ajwerner>, @itsbilal <https://github.com/itsbilal>, @jordanlewis <https://github.com/jordanlewis>, @nvanbenschoten <https://github.com/nvanbenschoten>, and @sumeerbhola <https://github.com/sumeerbhola>) ------------------------------ *docs/RFCS/20210914_disaggregated_storage.md, line 679 at r1 <https://reviewable.io/reviews/cockroachdb/cockroach/70419#-MkG4woD9IlGV5UV7bGG:-MkG4woD9IlGV5UV7bGH:b-wdviv6> (raw file <https://github.com/cockroachdb/cockroach/blob/c7743cfe99a820a2736aaa114c174ee6f73b5a85/docs/RFCS/20210914_disaggregated_storage.md#L679>):* We have already guaranteed that N2 has no data for R when it dropped the range (or never had any of it in its history). Is this considering the case where a follower replica falls behind in the Raft log and needs a snapshot to catch back up? In such cases, it will have data in its LSM (in both local and shared levels). So in those cases, do we need to clear out the keyspace first? If so then we'll eventually need to talk about how to make that atomic with the ingestion of the rest of the snapshot. We had to jump through some hoops (e.g. allowing range del tombstones in ingested ssts) to achieve that with today's design. But we could also explore other schemes that include more blunt coordination between foreground traffic (e.g. follower reads) and raft snapshots above the level of the LSM if we can't get the LSM updates to be entirely atomic. Atomicity also helps with untimely crashes midway through a snapshot, which is another consideration. This raises another question that I have that comes out of the "Dropping range state from a node" section. I think you might have answered it elsewhere, but I'm not sure. We talk about writing "point deletes or range deletes to delete data" in L0-L4 when dropping a range from a node, and then unlinking the virtual sstables from L5 and L6. If we're writing these deletions to the top of the LSM, then they won't immediately clear out the keyspace in L0-L4 unless we force compactions all the way through these levels. So how can we make assumptions about an empty keyspace later on when re-adding an overlapping range? And how can we ensure that these deletions from the previous incarnation of the range don't shadow the shared sstables in L5 and L6 when the range is re-added? Is the solution to this somewhere in the "Manufacturing sequence numbers for reads" section? ------------------------------ *docs/RFCS/20210914_disaggregated_storage.md, line 726 at r1 <https://reviewable.io/reviews/cockroachdb/cockroach/70419#-MkG97yH-Z4RX_dnhjyj:-MkG97yH-Z4RX_dnhjyk:b-o47i67> (raw file <https://github.com/cockroachdb/cockroach/blob/c7743cfe99a820a2736aaa114c174ee6f73b5a85/docs/RFCS/20210914_disaggregated_storage.md#L726>):* name. By default, that is the node responsible for it. If the node is dead, or the node requesting the file from the cache is in a different region, consistent hashing using the nodes in the region I'd be interested in hearing more about the interaction between shared sstables and region boundaries. I imagine that any shared sstable would be scoped to a single region and that the underlying block storage would be configured to use regional buckets. How does this interact with rebalancing between replicas across region boundaries? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#70419 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZDK47GO3NIQMRST3PTUDK5TPANCNFSM5EMCOFWQ> .

bdarnell · 2021-09-23T18:08:39Z

There's also this thread: Socrates/Rockset storage architecture brain dump
https://groups.google.com/a/cockroachlabs.com/g/kv/c/yP8civi0qmU/m/bJp72MrxAQAJ
which also has some Slack links.

Ah, that's the one I was thinking of (I found the one Andrei linked to in my searches, and it's also relevant, but I didn't participate in that one).

sumeerbhola

@bdarnell I added a TODO for Rockset and some links in the alternatives section.
I couldn't find much on RocksDB and HDFS other than facebook/rocksdb#5988

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 23 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It has also always been suspect that CPU overhead is cited as the limiting factor, but then we document an unscaled maximum capacity per node, instead of per vCPU.

The old recommendation didn't scale with vCPU count because I think we found some store-level mutex to be the limiting factor at the time. I'd hope we've fixed that since then so our next investigation of data density limits should not start with the assumption that the limit is per-node instead of per-vcpu.

Maybe the right question to ask is whether we have a comprehensive understanding of the density limits that we're hitting today.

I do not, but I tend to agree that without further evidence this isn't what I'd pursue if the primary goal is to increase data density per node.

I've added a few sentences stating that this is meant to be complimentary to proposals that remove constraints that currently force range boundaries at particular points (tenant transitions, regional by row tables when the region changes).

docs/RFCS/20210914_disaggregated_storage.md, line 24 at r1 (raw file):

I don't see an acknowledgement of the fact that a piece of data being part of the higher levels of the LSM does not mean that it's not being read frequently.

I've added some text to the caching section explicitly mentioning this. There was never meant to be an assumption.

This also tripped me up. If a sequential scan in an LSM needs to merge across all levels, then the assumption of temporal locality that a hierarchical caching strategy depends on falls apart.

The caching strategy is not intended to rely on temporal locality for 2 reasons: (a) as you mention everything is a scan from the LSM's perspective, since different MVCC versions of the same key are different keys for the LSM, and the LSM invariants cannot prevent a@10 from sinking down to L6 while a@2 stays at a higher level, (b) the latest version of some MVCC key could have been written a while ago. Related to (a), point reads from CockroachDB's perspective are not point reads from the LSM's perspective -- they are still a scan, but with the benefit of bloom filters.

The assumption being made is that parts of the key space are hot, so we have spatial locality. I've added a mention of this in the caching section.

I am not sure I understand the "hierarchical caching strategy" comment. The hierarchy in terms of placing L5, L6 on shared store is not because we expect them to be only cold data, but because we pay off a lot of write amplification in the local storage, and we can get 99+% of the data to be shareable.

My reading of the below [O1] is not about the "block cache" but about caching SSTs at lower layers which are accessed frequently in the local EBS filesystem.

Yes, that was the intention. It could also be local SSDs (I've tweaked the sentence).

docs/RFCS/20210914_disaggregated_storage.md, line 29 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Even if you don't go into detail, consider referencing cloud pricing somewhere in this doc to give a rough idea of the relative costs of different cloud storage alternatives. It wasn't clear to me until after a bit of research that "lower cost" translates to roughly a 75% savings when comparing SSD-backed EBS with S3. Though I imagine there are other costs that just cost per byte at rest. Still, that's useful context for readers.

I've added some references. I happened to look at GCS versus SSD PD which was $0.02 and $0.170, so the former is 12% of the latter.

docs/RFCS/20210914_disaggregated_storage.md, line 55 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't see anything in this doc about the consistency properties of shared storage (e.g. if write A happens before write B is it possible to read B without seeing A?). These services don't have the same properties as a posix filesystem; do they work the way we need them?

S3 and GCS (etc) have very similar interfaces, but different implementations under the hood. Are there any differences in their SLAs, performance, or consistency guarantees that we need to think about here?

Regarding consistency they provide strong consistency for what we need https://aws.amazon.com/s3/consistency/ and https://cloud.google.com/storage/docs/consistency. I've added some text in the next section.
The performance aspects will need to be understood with some experimentation.

docs/RFCS/20210914_disaggregated_storage.md, line 77 at r1 (raw file):

Previously, ajwerner wrote…

nit: add a \* here to escape the * which leads to everything being italicized after this point.

Done

docs/RFCS/20210914_disaggregated_storage.md, line 129 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This feels like a significant limitation to me - this is what prevents us from using this work to reduce the replication factor. I'd look at rockset-cloud to see if they have any good ideas for this. One possibility is to use something other than s3 for the WAL - would e.g. SQS or kinesis have the right properties for this?

Regarding the WAL, we're assuming there is no WAL for the LSM storing the state machine, by moving the raft log to a separate Pebble db that is using only local storage. If we had to have a WAL, we would put in on local storage like we are doing for sstables in L0-L4.

docs/RFCS/20210914_disaggregated_storage.md, line 330 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/sstbles/sstables/

Done

docs/RFCS/20210914_disaggregated_storage.md, line 421 at r1 (raw file):

Can we disallow LSM snapshots from preserving keys in shared sstables?

One possibility I had considered was to make LSM snapshots point to explicit shared sstables. The problem is that we can't mix shared sstables from an old version with local sstables from the most recent version.

Down below you talk about preventing deletion of memtables and local sstables on the source node during rebalancing. It seems like that implicit iterator snapshot could be used to eliminate use of LSM snapshots within CockroachDB when configured to share sstables.

Yes, we could stop using snapshots in favor of iterators, like we do in most of CockroachDB. I'm guessing that the main reason we sometimes use snapshots is that we are worried about OOD situations caused by holding iterators open for longer. If most of the data is in shared storage maybe this becomes less of a concern. I've added some text here.

docs/RFCS/20210914_disaggregated_storage.md, line 516 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/does exposes/does expose/

Done

docs/RFCS/20210914_disaggregated_storage.md, line 544 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: s/cyrrent/current/g

Done

docs/RFCS/20210914_disaggregated_storage.md, line 632 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/Initiliazing/Initializing/

Done

docs/RFCS/20210914_disaggregated_storage.md, line 642 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This sounds similar to the implicit snapshot created by iterators that prevent deletion of sstables that are being iterated over.

Yes, it is exactly that because it will be turned into an iterator below.

docs/RFCS/20210914_disaggregated_storage.md, line 670 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is this import operation similar to an ingest except that it is placing key span constraints on the imported table?

Not quite. They are placed in the same shared levels as in N1. The data that is copies is subject to ingestion.

docs/RFCS/20210914_disaggregated_storage.md, line 709 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I worked on a previous distributed system which used empty files as reference markers. One challenge we might experience here is what the semantics around files are in S3 or GCS. After creating an empty reference file, when are we guaranteed it exists?

Good point. They both claim strong consistency for reads after writes, including metadata operations https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel and https://cloud.google.com/storage/docs/consistency

sumeerbhola

TFTRs! I still have some comments left to address.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, and @sumeerbhola)

This RFC motivates and outlines a design to share the files of a log-structured merge tree across nodes, for potential cost reduction and higher storage density. More importantly it allows for cheap rebalancing, full backup/restore, and branching of data. Release note: None

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, and @petermattis)

docs/RFCS/20210914_disaggregated_storage.md, line 585 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm a bit confused by this section, possibly because I'm not understanding how these foreign sstables are imported. With sstable ingestion we assign a global seqnum to the records within the sstable at ingestion time. This is cleverly done by storing the global seqnum in the MANIFEST, not the ingested sstable. Don't we need to do something similar with these foreign sstables? Sstable ingestion gets away with a single global seqnum for the records in an sstable, but I think it is fine if we need a few seqnums as you note below. What I'm confused by is the reference to M? Is that the current seqnum of the LSM that is being imported into? I think it must be, otherwise importing a foreign sstable could rewrite history of the LSM.

I realized this was confusing and I've added some text where M was introduced. This is not the seqnum of the LSM. It is a fixed number (M=3 below). So, yes we are rewriting history, which this design considers acceptable since LSMs (modulo snapshots) are not really meant for tracking history.

I did consider two other options, by giving the foreign sstable a seqnum similar to what we do for ingest and

allow the foreign sstable to be at a higher level than L5 or L6. Now shared sstables could be scattered across any level and I stared having a hard time with this design.
violate the LSM seqnum invariant by pushing the foreign shared sstable down to L5 or L6 despite the fact that there may be lower seqnums above it. This unfortunately means we always need to compare seqnums across levels, which I think will complicate iteration code.

docs/RFCS/20210914_disaggregated_storage.md, line 679 at r1 (raw file):

Is this considering the case where a follower replica falls behind in the Raft log and needs a snapshot to catch back up? In such cases, it will have data in its LSM (in both local and shared levels). So in those cases, do we need to clear out the keyspace first?

Yes, it will need to clear out the keyspace. This is fine wrt concurrent evaluations, since those concurrent evaluations are operating with an iterator and seqnum that is not affected (files and memtable are fixed for an iterator).

So how can we make assumptions about an empty keyspace later on when re-adding an overlapping range? And how can we ensure that these deletions from the previous incarnation of the range don't shadow the shared sstables in L5 and L6 when the range is re-added?

Thanks for pointing this out! I completely overlooked this (now obvious) issue despite having thought through the fact that there would be data in L0-L4 when re-adding the range which is why we can't use vanilla import to slide these foreign sstables into L5 and L6 (related to the other question by @petermattis). I've been considering a few alternatives, including using the "virtual sstables" idea for L0-L4 too, where there would be an externally imposed key span constraint on the existing local sstables that overlap with the range being re-added. But I'm worried that this will get complicated when the same file over time collects many different key span constraints. And there could be also be old deleted data sitting in the memtable, so we'd need to flush the memtable.

An alternative, which doesn't require memtable flushing is to introduce a new LocalRangeIgnore[k1,k2)#seq range operation, which is additionally tracked in the manifest (tracked in manifest since it also needs to apply when compacting L4 and L5 files). Semantically, it tells the LSM that every iterator should ignore any points or ranges overlapping with [k1, k2) with seqnum < seq and that exist in the local levels.

LocalRangeIgnore would be written to the memtable, and will fall down to the bottom of the local levels and eventually be elided. It would also prevents seqnum zeroing of things later than it, like happens with range tombstones. It could also get fragmented like current range tombstones. A completing compaction would tell the manifest what spans and seqnums of LocalRangeIgnores it elided so that they can be gc'ed from the manifest.
More details for the write path:

add to memtable and WAL. Also add to DB level data structure of LocalRangeIgnore, such that seqnum publishing also makes it visible.
when memtable being flushed and discover this operation, add it to sstables and keep it in the compaction result-state in-memory. The compaction result-state tells the version edit to add it to the manifest file. So when the flush is applied and memtable disappears, it is present in persistent manifest state as global state and in sstables as LSM state. Same is done when WAL is replayed on open, since there could be state in WAL that has not been flushed.
compactions to L4 will compute LocalRangeIgnores to elide and will tell the version edit. It will edit the DB-level LocalRangeIngore data-structure at version installation time, and update the manifest with a subtraction of this span.

@jbowens

I've also been thinking about whether this is getting too complicated, and what can be done to simplify. My current impression is that this is complicated but manageable, and the only simplification avenue is an LSM per range. My concern there is that with an LSM per range we'd be entering very uncharted territory -- one can imagine many small tenants with say ~50MB of total data in a single range, and giving such a tenant a couple of dedicated sstables at L5 and L6 is one thing, and giving them a whole LSM is another. The chief issue with the latter being tiny sstables at L0-L4 and memtables (though we could possibly share the memtable), and higher write amplification in L0-L4.

docs/RFCS/20210914_disaggregated_storage.md, line 722 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

My instinct is to separate the cache into a new service, separate from CRDB nodes. The simplicity of having a single binary and a flat deployment story with nodes that do everything is not important in the managed service land.

I've added a paragraph at the end of this section listing this as an option.

docs/RFCS/20210914_disaggregated_storage.md, line 726 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'd be interested in hearing more about the interaction between shared sstables and region boundaries. I imagine that any shared sstable would be scoped to a single region and that the underlying block storage would be configured to use regional buckets. How does this interact with rebalancing between replicas across region boundaries?

Yes, with single region buckets, rebalancing across region boundaries would involve copying. I've added a section after this one -- GCS does offer multi-region capabilities with a modest byte cost difference and no network costs, so when possible we could use that.

docs/RFCS/20210914_disaggregated_storage.md, line 744 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The RocksDB secondary cache does the type of local caching. We could look to it for inspiration. I don't recall the caching being particularly complex.

I agree that it is very simple in that it makes eviction decisions for the whole cache file, so it doesn't need to track gaps that are freed that could be reused. I've added a paragraph.

itsbilal

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, and @petermattis)

docs/RFCS/20210914_disaggregated_storage.md, line 679 at r1 (raw file):

Previously, sumeerbhola wrote…

Is this considering the case where a follower replica falls behind in the Raft log and needs a snapshot to catch back up? In such cases, it will have data in its LSM (in both local and shared levels). So in those cases, do we need to clear out the keyspace first?

Yes, it will need to clear out the keyspace. This is fine wrt concurrent evaluations, since those concurrent evaluations are operating with an iterator and seqnum that is not affected (files and memtable are fixed for an iterator).

So how can we make assumptions about an empty keyspace later on when re-adding an overlapping range? And how can we ensure that these deletions from the previous incarnation of the range don't shadow the shared sstables in L5 and L6 when the range is re-added?

Thanks for pointing this out! I completely overlooked this (now obvious) issue despite having thought through the fact that there would be data in L0-L4 when re-adding the range which is why we can't use vanilla import to slide these foreign sstables into L5 and L6 (related to the other question by @petermattis). I've been considering a few alternatives, including using the "virtual sstables" idea for L0-L4 too, where there would be an externally imposed key span constraint on the existing local sstables that overlap with the range being re-added. But I'm worried that this will get complicated when the same file over time collects many different key span constraints. And there could be also be old deleted data sitting in the memtable, so we'd need to flush the memtable.

An alternative, which doesn't require memtable flushing is to introduce a new LocalRangeIgnore[k1,k2)#seq range operation, which is additionally tracked in the manifest (tracked in manifest since it also needs to apply when compacting L4 and L5 files). Semantically, it tells the LSM that every iterator should ignore any points or ranges overlapping with [k1, k2) with seqnum < seq and that exist in the local levels.

LocalRangeIgnore would be written to the memtable, and will fall down to the bottom of the local levels and eventually be elided. It would also prevents seqnum zeroing of things later than it, like happens with range tombstones. It could also get fragmented like current range tombstones. A completing compaction would tell the manifest what spans and seqnums of LocalRangeIgnores it elided so that they can be gc'ed from the manifest.
More details for the write path:

add to memtable and WAL. Also add to DB level data structure of LocalRangeIgnore, such that seqnum publishing also makes it visible.

when memtable being flushed and discover this operation, add it to sstables and keep it in the compaction result-state in-memory. The compaction result-state tells the version edit to add it to the manifest file. So when the flush is applied and memtable disappears, it is present in persistent manifest state as global state and in sstables as LSM state. Same is done when WAL is replayed on open, since there could be state in WAL that has not been flushed.

compactions to L4 will compute LocalRangeIgnores to elide and will tell the version edit. It will edit the DB-level LocalRangeIngore data-structure at version installation time, and update the manifest with a subtraction of this span.

@jbowens

I've also been thinking about whether this is getting too complicated, and what can be done to simplify. My current impression is that this is complicated but manageable, and the only simplification avenue is an LSM per range. My concern there is that with an LSM per range we'd be entering very uncharted territory -- one can imagine many small tenants with say ~50MB of total data in a single range, and giving such a tenant a couple of dedicated sstables at L5 and L6 is one thing, and giving them a whole LSM is another. The chief issue with the latter being tiny sstables at L0-L4 and memtables (though we could possibly share the memtable), and higher write amplification in L0-L4.

I like the LocalRangeIgnore suggestion and vastly prefer it as a manageable complexity to having one LSM per range. We've optimized ourselves around large LSMs from our onset and there's a lot that'll get thrown out the window and replaced with unknown unknowns if we make a switch that drastic in the avg LSM size.

But a slightly different question: is the raft-follower-catchup scenario the only scenario where we have to consider the "ingestion on a non-empty key range" case? Do we want to do something special for imports too so they could take advantage of shared sstables, or should we just have them continue ingesting through local levels and just pay the increased write-amp?

sumeerbhola · 2021-10-19T17:58:54Z

But a slightly different question: is the raft-follower-catchup scenario the only scenario where we have to consider the "ingestion on a non-empty key range" case?

Also the scenario where a range is removed and added back. It is possible that the local data preceding the removal has not yet been fully compacted away.

Do we want to do something special for imports too so they could take advantage of shared sstables, or should we just have them continue ingesting through local levels and just pay the increased write-amp?

Is this imports in the sense of bulk-io operations that build indexes, restore from backup etc.? Backup restore should be able to do something similar with shared files. I haven't thought about building indexes -- if they are large and we can build them fully with large shared files then probably worth doing it (we wouldn't want the small files that we currently sometimes build because the new secondary index has an ordering that is very different from the primary).

itsbilal · 2021-11-29T19:59:58Z

I partitioned the work in this RFC into three parts, and prototyped parts 1 and 2 in these branches (https://github.com/itsbilal/cockroach/tree/disagg-storage-prototype and https://github.com/itsbilal/pebble/tree/disagg-storage-wip ):

Adding support for a second vfs.FS, called a SharedFS internally, that gets used to store L5 and L6 files. SharedFS is expected to be slower for reads/writes than the main FS. My prototype has ingestions and compactions into L5/6 write to localFS first, then move the file to sharedFS before completing the compaction/ingestion.
Adding a configurable persistent cache, that copies files from sharedFS to local FS once a sufficient number of bytes have been read from them. The sstable.Reader checks if a file is cached in the local FS before reading from the appropriate file handle.
Everything else required to make sstables shareable and rebalances cheap: virtual sstables, sequence number changes, and tracking of shared sstables. I did not prototype any of these.

I was able to reuse most of cloud.ExternalStorage and give it a vfs.FS-like API so it could be used within Pebble as the sharedFS. I added some optimizations, such as backing writers with a bufio.Writer, writing a version of ExternalSSTReader that was concurrent-safe (as Pebble could issue concurrent ReadAt calls), that also did a best-effort conversion of short reads into longer reads with a small per-file cache. Some of the constants around here could be tweaked with more experimentation, but this got me good-enough performance.
My prototype of the persistent cache implemented an LRU cache for simplicity, and didn't persist any cache state, so files would have to be redownloaded after node restart. For my experiments that was sufficient.

With that prototype, I ran TPCC on a bunch of configurations. I deduced that roughly, each c5d.4xlarge could sustain these amounts of replicated data:

On S3 as the disaggregated store: ~40GB (replicated) of frequently-read/updated TPCC data with a 10gb persistent cache on instance store (actual disk utilization was closer to 12gb, of which 10gb was the persistent cache and ~1-2gb was higher levels of LSM)
On EBS GP3 volumes, no disaggregated store: ~50GB per node of replicated TPCC data (disk utilization was ~50GB)
On instance store, EC2 c5d.4xlarge: ~60GB per node of replicated TPCC data

Disaggregated storage paired with a fairly rudimentary persistent cache gets us pretty close to the same read/write TPCC performance as pure EBS/instance store with a fraction of the local disk utilization. As a result, theoretically a lot more data could be addressable under the disaggregated storage scenario than in the non-disaggregated storage one per node, at an additional cost of $0.02 per GB (S3 standard) vs. $0.08 per GB for EBS gp3 (slightly more for io2), both of which are well below the cost for additional nodes with more instance storage.

However, going in that direction might require us to evaluate all our existing bottlenecks around node densities and see if we can exceed our usual 2.5TB per node recommendation. @ajwerner I believe tested densities as high as 4.5TB per node last year, and since then we've only optimized Pebble more for large imports and metadata handling. Plus, if we fully implement disaggregated storage, we'll also get cheaper rebalances and maybe even imports, easing one of the bottlenecks he identified (I believe).

Here's a hypothetical example assuming we are able to achieve that: we have 10TB of replicated, mostly cold data, 100GB of which is frequently read/updated. A cluster with 3 c5.4xlarge nodes and 4TB EBS gp3 volumes per node would be approximately $500/mo more expensive than having the same nodes but with 40gb EBS gp3 volumes (for the cache and higher levels) and storing the rest in S3 using disaggregated storage. My estimates are all on this spreadsheet: https://docs.google.com/spreadsheets/d/1z2Uq9ReoQo1PBzjztMCTobvkfHavQAZ9r2S4QBhdehY/edit?usp=sharing

Obviously, if the alternative is to pay for more compute that sits around just so we get more local storage, that'll be significantly more expensive than either example. But that aside, the cost differential seems to be pretty small unless we're talking about 10TB+ db sizes that are also mostly cold data.

Going forward I'll be connecting with @mwang1026 and others to scope out the scale of workloads we're talking about, to figure out the cost/viability of going forward with this project.

sumeerbhola · 2021-11-29T22:13:48Z

It's great to see these results!

we have 10TB of replicated, mostly cold data, 100GB of which is frequently read/updated. A cluster with 3 c5.4xlarge nodes and 4TB EBS gp3 volumes per node would be approximately $500/mo more expensive than having the same nodes but with 40gb EBS gp3 volumes (for the cache and higher levels) and storing the rest in S3 using disaggregated storage.

Is this 10TB the aggregate bytes after replication, which is why there are 4TB gp3 volumes?

And how does this ratio of cold/total bytes relate to what you observe in TPCC? Based on the 40GB per node with 10GB persistent cache, is the cold/total bytes ratio for TPCC equal to 0.25?

itsbilal · 2021-11-30T16:07:01Z

Is this 10TB the aggregate bytes after replication, which is why there are 4TB gp3 volumes?

Yes, this is correct. 10TB after replication. I use that same convention throughout my reply, referring to storage used after replication, not before.

And how does this ratio of cold/total bytes relate to what you observe in TPCC? Based on the 40GB per node with 10GB persistent cache, is the cold/total bytes ratio for TPCC equal to 0.25?

The 40GB per node x 3 node thing was just to give a storage usage translation for 2000 TPC-C warehouses. In the scenario where I ran TPC-C 2k with a 10gb persistent cache, all of the data was actually hot as all 2000 of the imported warehouses were active. The cache saw quite a bit of evictions, but we were able to sustain a reasonable latency for all queries in the 2k workload with just a 10gb persistent cache (but not any smaller). I don't know of a workload that lets us try out different access patterns/frequencies, so my only choice with TPCC was to assume some data is very cold (inactive warehouses) and the rest is hot. Maybe I could layer two YCSB workloads one beside the other or something.

petermattis · 2021-11-30T16:18:16Z

Ditto on the wonderfulness of these results.

I don't know of a workload that lets us try out different access patterns/frequencies, so my only choice with TPCC was to assume some data is very cold (inactive warehouses) and the rest is hot. Maybe I could layer two YCSB workloads one beside the other or something.

TPCC active data scales with the number of active warehouses. And TPCC throughput is limited by the number of active warehouses. YCSB might be more representative of a storage-layer test as it is less sensitive to higher layers in the stack (KV, and SQL) and also gives us a skewed distribution by default. What did you mean by "layer two YCSB workloads one beside the other"? Wouldn't one YCSB workload be sufficient to demonstrate the impact of the caching?

itsbilal · 2021-12-02T16:20:20Z

YCSB might be more representative of a storage-layer test as it is less sensitive to higher layers in the stack (KV, and SQL) and also gives us a skewed distribution by default. What did you mean by "layer two YCSB workloads one beside the other"? Wouldn't one YCSB workload be sufficient to demonstrate the impact of the caching?

Ah right, I forgot most YCSB workloads have zipfian request key distributions instead of uniform, so one workload should suffice. Thanks for bringing that up! I'll test YCSB-B on my prototype and see how that goes.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 315 at r3 (raw file):

We adopt the following simplifications:

- If node N1 has an LSM containing range R and node N2 desires to use

Jumping off from the internal conversation here. Have you explored the portion of the design space where instead of having two nodes coordinate in a pairwise manner over shared SSTs we coordinate the shared sstables among all of the replicas and use replicated state to coordinate some or all of the sequence mapping?

It seems ideal to have a minimum number of replicas performing the writes to the blob store and, as a goal, to have to have just one copy of data per region in the blob store. An additional goal, which would be amazing, is that adding new follower replicas should not require a full snapshot of the entire range. This does not seem to be handled by the

Another modest source of complexity with this proposal is that right now our rebalancing protocol doesn't involve communication between N1 and N2 unless N1 is the leaseholder.

Most of the time, all replicas of a range are available. It seems okay to me that we'd stop off-loading data to the object store when one or more replicas is unavailable. Ideally we'd work to make re-replication happen in a timely manner. The core of the idea is that instead of two nodes coordinating only when they need to transfer data between each other, we'd have the leaseholder coordinate with all replicas when it wants to offload data into the blob store. It could use RPCs to set up the appropriate state and then use a replicated command to "activate" the cut over?

I realize I'm waving my hands rather dramatically. After reading this RFC again, it's clear that the coordination during transfer and ingestion is made simple by leveraging the fact "that N2 has no data for R when it dropped the range (or never had any of it in its history).". I do see that if we wanted to coordinate a cut-over we'd need to deal with the fact that all replicas are actively compacting data which overlaps with the data being extracted to the blob store.

docs/RFCS/20210914_disaggregated_storage.md, line 778 at r3 (raw file):

### Distributed caching of whole shared sstables

We do not want more than one copy per region. There are two aspects to

It feels to me like we might instead want no more than one copy per AZ. Cross AZ data transfer within a region is still expensive.

ajstorm

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 791 at r3 (raw file):

- When is the cache populated:
  - We need to have a policy that determines whether a cache miss (or

As part of this I would think we might also need to consider how full table scans will impact the cache. Full table scans are already really bad for performance, but they'll be far worse if they also have the side effect of trashing the cache by evicting a large number of otherwise hotter blocks (especially in cases where the total table size exceeds the cache size). I've seen approaches in the past where a hint is provided with a full table scan to not cache at all, as it's known that full caching will have negative consequences to concurrent queries. Or, perhaps this is what you meant by "set of cache misses"?

Code quote:

  - We need to have a policy that determines whether a cache miss (or
    set of cache misses) will cause the file to be added to the
    cache. We can possibly repurpose caching policies used in block

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md, line 778 at r3 (raw file):

Previously, ajwerner wrote…

It feels to me like we might instead want no more than one copy per AZ. Cross AZ data transfer within a region is still expensive.

Agreed. @itsbilal corrected my misreading of the language on https://aws.amazon.com/ec2/pricing/on-demand/. And I checked GCP and it has the same price of $0.01/GB https://cloud.google.com/vpc/network-pricing for intra-region.

docs/RFCS/20210914_disaggregated_storage.md, line 791 at r3 (raw file):

Previously, ajstorm (Adam Storm) wrote…

As part of this I would think we might also need to consider how full table scans will impact the cache. Full table scans are already really bad for performance, but they'll be far worse if they also have the side effect of trashing the cache by evicting a large number of otherwise hotter blocks (especially in cases where the total table size exceeds the cache size). I've seen approaches in the past where a hint is provided with a full table scan to not cache at all, as it's known that full caching will have negative consequences to concurrent queries. Or, perhaps this is what you meant by "set of cache misses"?

Agreed that we always need a "scan resistant" cache.
There are algorithms that explicitly or implicitly track reuse distance, instead of recency, and use that for deciding on replacement (including Pebble's implementation of clock-pro for the block and table caches), and are therefore categorized as "scan resistant". So ideally we should not need a hint.

nvanbenschoten · 2022-01-10T16:24:53Z

A medium-term alternative originally proposed by @lidorcarmel that warrants further discussion is the use of HDD block storage in place of either SSD block storage or object storage.

One of the two primary benefits of this RFC is to reduce data storage costs. As currently proposed, the RFC accomplishes this by storing the bulk of a cluster's data in object storage. Object storage is able to provide cheaper data storage in large part because it uses throughput-optimized HDDs behind the scenes. However, clouds also provide access to latency-optimized HDDs in the form of block storage. This form of storage has many of the benefits of object storage, but with fewer of the downsides.

Compared to SSD block storage, HDDs are about 4 times cheaper per byte. Counter-intuitively, Lidor mentioned that HDD block storage should also have comparable write performance to SSD block storage under most workloads because writes don't touch disk before an acknowledgment for either class of storage (citation needed). The larger impact on performance will be on random-access reads, where HDD block storage will have higher latency by a factor of 2 or 3.

Compared to object storage, HDDs are orders of magnitude faster to read from and write to (citation needed). They are also easier to work with — they look identically to the SSD block storage volumes we are used to using. Block storage misses out on some of the other benefits of object storage presented in this RFC (e.g. faster, cheaper rebalancing), but my reading is that these benefits are more speculative at this point.

So this all feels like an alternative worth exploring. In its simplest form, we may be able to achieve a cost savings of about 75% on data storage by switching clusters that are not performance sensitive (e.g. free-tier) over from SSD to HDD block storage. Once we understand the performance characteristics of the two storage classes more clearly, a hybrid architecture may emerge. For instance, if read latency takes the biggest hit, it may make sense to layer @itsbilal's persistent cache (backed by an SSD PD or even an "ephemeral" local SSD) from itsbilal/pebble@c467ffd on top of an HDD backed LSM. Or if write performance isn't quite on par for HDDs, it could make sense to keep the WAL and top few levels on the LSM on SSD PD but move the bottom levels of the LSM to HDD PD.

It feels to me like we might instead want no more than one copy per AZ. Cross AZ data transfer within a region is still expensive.

How much of the cost-benefit of this proposal is eroded by the need for fast cached copies in each region? What about if we need to store a copy per AZ? What fraction of a cluster's data do we expect to be cached at any point in time? What will the impact of cache misses be to tail latency?

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, and @sumeerbhola)

a discussion (no related file):
I wanted to suggest making disaggregated storage a table-level setting, and add other settings, like memory-only. I'm not sure what the right granularity for such a setting, perhaps we can look at access patterns to decide on that. This would be similar to bigtable locality groups where you could say whether it's stored in memory or on disk, we could similarly say “this is a memory-only table” or disk, flash, disagregated etc. For example, for time-series data / other observability data we need to power dashboards, an in-memory table could provide the performance we'd need, and losing durability would probably be ok.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, @shralex, and @sumeerbhola)

a discussion (no related file):

Previously, shralex wrote…

I wanted to suggest making disaggregated storage a table-level setting, and add other settings, like memory-only. I'm not sure what the right granularity for such a setting, perhaps we can look at access patterns to decide on that. This would be similar to bigtable locality groups where you could say whether it's stored in memory or on disk, we could similarly say “this is a memory-only table” or disk, flash, disagregated etc. For example, for time-series data / other observability data we need to power dashboards, an in-memory table could provide the performance we'd need, and losing durability would probably be ok.

Weaker durability is something cockroach struggles with because of our lack of tooling to deal with lost portions of the keyspace. There's some theoretical problems with loss of durability due to its implications for correctness for transactions which spanned the different durability tables. We could side-step that by not allowing such transactions.

Regarding different cost storage mediums for different tables, we sort of have a solution for that already. The solution we'd pitch to customers is to create different stores and label them with the medium. Then, use zone configurations to constrain tables to the relevant storage medium. https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#stricter-replication-for-a-table-and-its-secondary-indexes walks through this example.

itsbilal · 2022-01-28T02:55:24Z

I tested out my prototype of disaggregated storage with a persistent cache, on two different setups: the reference setup from earlier with a 10gb persistent cache on c5d.4xlarge local ssds and a slower tier on s3 standard. The second setup is the one @nvanbenschoten suggested: 10gb persistent cache on c5d.4xlarge local ssds again, but with the lower tier being st1 (throughput optimized HDD) EBS volumes.

Cost wise, it's important to remember that S3 standard is $0.023 per GB/month while EBS st1 is $0.045 per GB/month - about twice the price. Cold HDDs are $0.015 per GB/month, but it looks like it's meant for cold, infrequently-accessed data and would not yield us interesting results so I haven't tested it yet.. So if the goal is to reduce storage costs, s3 standard or cold HDDs are the way to go. If the goal is to also reduce replication costs, s3 standard also wins out there as it's got free data transfer from any AZ in that region.

The performance results are interesting. The HDD setup edged out the S3 standard setup for imports, because that's all sequential IO work. But that's really the only place where it was significantly faster than S3.

On TPC-C, our reference s3 standard + local ssd cluster continued to do just over 2000 active warehouses as before. The st1 HDD EBS volumes + local ssd cluster was slower at warming up its cache, but once it had gotten there it pushed 1000 active warehouses but had a lot of difficulty exceeding that without significantly increasing the persistent cache size (and therefore amount on local ssd). This makes sense, TPCC uniformly distributes load across the active warehouses so if a lot of data was being randomly read from the HDD layer, it ended up losing out to S3's faster random reads despite the greater network hops required for S3.

Performance in YCSB-land was similar, with a more obviously higher tail latency due to the skewed YCSB key distribution. While I saw latency spikes on the s3 cluster as it's more prone to network blips and congestion (AWS guarantees some network bandwidth for EBS), it still edged out the HDD-backed EBS cluster. Here's ycsbA (50% reads, 50% updates):

s3:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0       12193782        10161.5     26.5      3.5     54.5    100.7  64424.5  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0         641490          534.6     25.4      5.2     62.9    104.9  64424.5  update

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1200.0s        0       12835272        10696.1     26.5      3.5     54.5    100.7  64424.5
 
ebs with hdd:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0         173112          144.3   1722.4      0.8   4831.8  62277.0 103079.2  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0           8985            7.5   1614.3      1.4   4831.8  55834.6 103079.2  update

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1200.0s        0         182097          151.7   1717.1      0.8   4831.8  62277.0 103079.2

So lower median latencies, but significantly fewer ops/sec and worse tail latencies on EBS with hdd. It seems like the worse random read performance with HDDs is holding that setup back.

The story is similar for ycsb-B (95% reads, 5% updates):

s3:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0       12193782        10161.5     26.5      3.5     54.5    100.7  64424.5  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0         641490          534.6     25.4      5.2     62.9    104.9  64424.5  update

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1200.0s        0       12835272        10696.1     26.5      3.5     54.5    100.7  64424.5

ebs with HDD:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0         173112          144.3   1722.4      0.8   4831.8  62277.0 103079.2  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1200.0s        0           8985            7.5   1614.3      1.4   4831.8  55834.6 103079.2  update

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1200.0s        0         182097          151.7   1717.1      0.8   4831.8  62277.0 103079.2

In both cases, I had imported 100 million YCSB keys (~120GB replicated), and were running this command:

./workload run ycsb --workload {A,B} --concurrency=288 --record-count=30000000 --ramp=10m --duration=20m {pgurl:1-3}

Interestingly enough, if I had just imported 30m YCSB keys instead of 100m and run the exact same command, the performance was much faster for both clusters, and one where the HDD setup was much closer or in some runs surpassing the S3 ones (I'll need to re-run this setup to confirm this, so don't read too much into this). This probably shows a limitation in having a file-based persistent cache like in my prototype; a block-based persistent cache similar to RocksDB might be better at cache utilization due to only a small number of blocks in each file being hot.

All in all, seeing as we can still achieve a modest reduction in storage costs by implementing the most basic parts of disaggregated storage and using either an HDD EBS volume or an S3 bucket as the slower tier of storage, before we even do the hard work for faster/cheaper rebalances, it might be worthwhile to go ahead with it. The cheaper rebalances will be S3/GCS specific as they require the second tier to be shared, but that can come later.

joshimhoff

This is very cool!

If we go with HDD block storage it doesn't change the read side availability story, from what I can tell.

If we go with S3 / GCS, as mentioned in this doc, it does change change the read side availability story. I see discussion of this in the RFC, e.g. the parts about caching. I don't feel tho the discussion in the RFC so far makes a clear case for how exactly we will leverage a 99.9% available regional service (https://cloud.google.com/storage/sla) to create a 99.99% CC regional service (we likely will soon push the CC SLA for 99.95% to 99.99%). If I understand https://cloud.google.com/compute/sla correctly, the key difference from block storage is that the latter is a zonal service, and cloud providers make a 99.99% commitment for the availability of multiple instances running across different zones, with a down minute being (roughly) defined as one where instances in multiple zones are down. This formalizes the idea that zones should fail independently. S3 / GCS have no such concept in their SLAs; expected availability is 99.9% period.

The doc also notes this it is not a detailed design! But still.... anyone else share my anxiety? Maybe @bdarnell based on a previous comment?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, @shralex, and @sumeerbhola)

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, @shralex, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md line 166 at r3 (raw file):

excessive operation costs, so it is desirable to pay some of the write
amplification cost in local-store. In theory, an LSM which has all
levels L0-L6 populated with files has 60x write amplification. Levels

how do you get to 60x?

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @andreimatei, @bdarnell, @itsbilal, @jbowens, @jordanlewis, @nvanbenschoten, @petermattis, @shralex, and @sumeerbhola)

docs/RFCS/20210914_disaggregated_storage.md line 179 at r3 (raw file):

For the LSM that is split across local-store and shared-store, we
assume the absence of a WAL. That is, its full state consists only of

Can you spell out more why the absence of the WAL is related to this project?

sumeerbhola requested review from itsbilal, jbowens, petermattis and a team September 20, 2021 14:22

sumeerbhola requested a review from a team as a code owner September 20, 2021 14:22

andreimatei reviewed Sep 20, 2021

View reviewed changes

petermattis reviewed Sep 21, 2021

View reviewed changes

jbowens reviewed Sep 21, 2021

View reviewed changes

ajwerner reviewed Sep 21, 2021

View reviewed changes

nvanbenschoten reviewed Sep 21, 2021

View reviewed changes

ajwerner reviewed Sep 21, 2021

View reviewed changes

bdarnell reviewed Sep 22, 2021

View reviewed changes

andreimatei reviewed Sep 22, 2021

View reviewed changes

nvanbenschoten reviewed Sep 23, 2021

View reviewed changes

sumeerbhola force-pushed the disaggregate branch from c7743cf to d86cd6e Compare October 11, 2021 16:14

sumeerbhola commented Oct 11, 2021

View reviewed changes

sumeerbhola force-pushed the disaggregate branch from d86cd6e to efb2775 Compare October 11, 2021 20:18

sumeerbhola commented Oct 11, 2021

View reviewed changes

itsbilal reviewed Oct 18, 2021

View reviewed changes

ajwerner reviewed Dec 16, 2021

View reviewed changes

ajstorm requested review from ajwerner, andreimatei, bdarnell, itsbilal, jbowens, jordanlewis, nvanbenschoten and petermattis January 3, 2022 18:48

ajstorm reviewed Jan 3, 2022

View reviewed changes

ajstorm self-requested a review January 3, 2022 18:48

sumeerbhola commented Jan 4, 2022

View reviewed changes

knz added the X-noremind Bots won't notify about PRs with X-noremind label Jan 10, 2022

shralex reviewed Jan 24, 2022

View reviewed changes

ajwerner reviewed Jan 24, 2022

View reviewed changes

joshimhoff requested review from ajwerner and shralex March 9, 2022 21:14

joshimhoff reviewed Mar 9, 2022

View reviewed changes

sumeerbhola mentioned this pull request May 4, 2022

db: split physical sstables into virtual sstables at ingest time cockroachdb/pebble#1683

Closed

tbg mentioned this pull request May 17, 2022

kv: prioritize snapshots to leaseholders, then voters, then learners #80817

Open

andreimatei reviewed Jul 22, 2022

View reviewed changes

nvanbenschoten removed their request for review September 27, 2022 19:39

cockroachdb deleted a comment from amruss Oct 14, 2022

itsbilal mentioned this pull request Oct 21, 2024

storage: complete the disaggregated storage RFC #133085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: disaggregated shared storage #70419

rfc: disaggregated shared storage #70419

sumeerbhola commented Sep 20, 2021

cockroach-teamcity commented Sep 20, 2021

andreimatei left a comment

petermattis left a comment

jbowens left a comment

ajwerner left a comment

nvanbenschoten left a comment

ajwerner left a comment

bdarnell left a comment

andreimatei left a comment

nvanbenschoten left a comment

tbg commented Sep 23, 2021 via email

bdarnell commented Sep 23, 2021

sumeerbhola left a comment

sumeerbhola left a comment

sumeerbhola left a comment

itsbilal left a comment

sumeerbhola commented Oct 19, 2021

itsbilal commented Nov 29, 2021 •

edited

Loading

sumeerbhola commented Nov 29, 2021

itsbilal commented Nov 30, 2021

petermattis commented Nov 30, 2021

itsbilal commented Dec 2, 2021

ajwerner left a comment

ajstorm left a comment

sumeerbhola left a comment

nvanbenschoten commented Jan 10, 2022 •

edited

Loading

shralex left a comment

ajwerner left a comment

itsbilal commented Jan 28, 2022 •

edited

Loading

joshimhoff left a comment

andreimatei left a comment

andreimatei left a comment

rfc: disaggregated shared storage #70419

Are you sure you want to change the base?

rfc: disaggregated shared storage #70419

Conversation

sumeerbhola commented Sep 20, 2021

cockroach-teamcity commented Sep 20, 2021

andreimatei left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg commented Sep 23, 2021 via email

bdarnell commented Sep 23, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

sumeerbhola commented Oct 19, 2021

itsbilal commented Nov 29, 2021 • edited Loading

sumeerbhola commented Nov 29, 2021

itsbilal commented Nov 30, 2021

petermattis commented Nov 30, 2021

itsbilal commented Dec 2, 2021

ajwerner left a comment

Choose a reason for hiding this comment

ajstorm left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 10, 2022 • edited Loading

shralex left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

itsbilal commented Jan 28, 2022 • edited Loading

joshimhoff left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

itsbilal commented Nov 29, 2021 •

edited

Loading

nvanbenschoten commented Jan 10, 2022 •

edited

Loading

itsbilal commented Jan 28, 2022 •

edited

Loading