Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mini-rfc: Non-primary column families #42038

Open
bdarnell opened this issue Oct 30, 2019 · 1 comment
Open

mini-rfc: Non-primary column families #42038

bdarnell opened this issue Oct 30, 2019 · 1 comment
Labels
C-investigation Further steps needed to qualify. C-label will change. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Oct 30, 2019

This idea was inspired by brainstorming around partitioning and the effort to close the gap between primary and secondary indexes (#41989). I don't have a concrete use case for this yet so I'm just writing it up briefly to see if there's any interest in pursuing this idea further.

Column families allow the primary index (and currently only the primary index) to be divided into multiple KV pairs. This has two main benefits: fine-grained latching for reduced contention (useful at least for YCSB), and reduction in write amplification (especially if there are infrequently-updated blob columns). However, it's kind of a complex and subtle special case for something so rarely used. I propose replacing this with a generalization of storing indexes.

In the new model, a table with two column families would have two indexes instead of a single primary key. each of these indexes would have the same key columns, but "store" a different subset of the table's columns. This would change the constructed key from /$TABLE/1/$PK/$FAMILY to /$TABLE/$INDEX/$PK/0, and place columns from different families far apart from each other. This means that single-row operations are no longer guaranteed to be single-range, which is a downside if you often operate on the entire row, but could be a benefit if you usually operate on parts of the row at a time (which is exactly the time when column families make sense). The benefit would be especially useful in the "blob" use case, since the non-blob column family would be denser with real data. A "free" side effect is that column families would become targets for zone configs, so you could store your blobs on cheaper storage (and maybe this could be a step towards column-level security that goes all the way through the KV layer)

This model gets more interesting if we generalize it from "two half-primary keys" to "every column must be stored in at least one index" (and more subtly, there must be paths to look up every column given a PK). This allows for columns in different families to even be partitioned differently (for example to make some columns available for follower reads in other regions while other columns are replica-partitioned to have faster writes in the home region).

This model appeals to me on a theoretical level because it removes the "special case" of column families in place of a generalization of the relationship between tables and indexes. However, it also introduces a lot of new complexity in the form of complex relationships between indexes and invariants that need to be preserved. I think I've mostly talked myself out of this idea since I haven't been able to come up with use cases that it would help, but I wanted to write it down for posterity and see if anyone else was inspired by it.

Jira issue: CRDB-5398

@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@jordanlewis jordanlewis added C-investigation Further steps needed to qualify. C-label will change. and removed no-issue-activity labels Jun 8, 2021
@jlinder jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

3 participants