Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for column families #422

Merged
merged 6 commits into from
Jun 28, 2016
Merged

Docs for column families #422

merged 6 commits into from
Jun 28, 2016

Conversation

jseldess
Copy link
Contributor

@jseldess jseldess commented Jun 27, 2016

  • Regenerated and adjusted grammar diagrams.
  • Added column families to create-table.md.
  • Added stand-alone page on column families: column-families.md.
  • Updated kv faq to mention using column families.

HTML version available here: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/6545def8fd90b97db710c935cc7322955af9fa4c/

Fixes #423


This change is Reviewable

@jseldess
Copy link
Contributor Author

@paperstreet, I don't think the story is right yet. Once we define column families by default, I'm inclined to reveal more about that in column-families.md and create-table.md. But please let me know your thoughts. I know I'm light on details here, and there must be areas I'm missing.

@petermattis

@danhhz
Copy link
Contributor

danhhz commented Jun 27, 2016

I think the story is pretty good as you have it. Here's my initial thoughts. I'll do a followup pass and think of what you're missing


Review status: 0 of 9 files reviewed at latest revision, 4 unresolved discussions.


column-families.md, line 9 [r1] (raw file):

As of the `beta-20160629` release, CockroachDB supports **column families**. A column family is a group of columns in a table that are stored as a single key-value pair in the underlying key-value layer. Storing column values in this way significantly reduces storage overhead. 

{{site.data.alerts.callout_info}}Tables created as of <code>beta-20160629</code> are not be compatible with earlier versions of CockroachDB.{{site.data.alerts.end}}

This is only true if you use the feature, which I feel is an important distinction. If you've upgraded and haven't manually assigned families (even if you've created new tables), you can downgrade


column-families.md, line 15 [r1] (raw file):

## Overview

When a row is inserted into a table with column families, CockroachDB stores the values for each column family as a single key-value pair. For example, consider a table with 2 columns, where the columns are grouped into a column family:

single key-value pair per family


column-families.md, line 25 [r1] (raw file):

Inserting 10 rows into this table would create 10 underlying key-value pairs, 1 per row. In contrast, if the table's columns were not grouped into a family, each column and value would be a distinct key-value pair; thus, inserting 10 rows would create 20 underlying key-value pairs, 2 per row.

I'm not really sure what we're contrasting to. If some competitor, which one? If previous versions of cockroach, this is not technically true, because we'd have an extra +1 per row for a total of 30


column-families.md, line 77 [r1] (raw file):

## Upcoming Improvements

In an upcoming release, you won't need to define column families manually. Instead, CockroachDB will group columns into families for you when a table is created. CockroachDB's default groupings will ensure optimal storage and performance, but you will still have the option to define your own groups using the `FAMILY` keyword, as show above. 

optimal is far too strong a promise for what is coming : - )

maybe reasonable


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 27, 2016

Okay this what I could think of that's missing

dt mentions that the "Storage Parameters" section of https://www.postgresql.org/docs/current/static/sql-createtable.html is a good example of the level of emphasis we should be placing on the manual configuration of these. Note that it's fairly buried.


Review status: 0 of 9 files reviewed at latest revision, 6 unresolved discussions.


column-families.md, line 7 [r1] (raw file):

---

As of the `beta-20160629` release, CockroachDB supports **column families**. A column family is a group of columns in a table that are stored as a single key-value pair in the underlying key-value layer. Storing column values in this way significantly reduces storage overhead. 

The even bigger advantage is that it reduces the number of keys you're changing during inserts/updates/deletes, which has all kinds of performance benefits. Less data over the wire, smaller raft and rocksdb logs, fewer write intents to conflict. Basically, this is even a bigger deal for a 3 node setup than a single node.

Maybe "


column-families.md, line 52 [r1] (raw file):

To group columns into multi-column families, you must use the FAMILY keyword, for example:

I don't think it's clear anywhere that you can specify more than one family. You might want to do this if you have one small column that gets updated a lot and one big column that doesn't. Because of the way families work, if you put both in one family, the big one will have to be rewritten every time the little one is updated. This general issue is why we won't simply always use one family

Important to emphasize: You should always use as few column families as reasonable. Our upcoming heuristic tries to group "bounded-size" columns (INT, TIMESTAMP, STRING(100), etc). Unfortunately, if the user specifies STRING, DECIMAL, BYTES without a max size, we have to be pessimistic and give it its own group. This is the major reason that users should expect to tell cockroach about family assignments. If you know that your STRING (or DECIMAL or BYTES) field will usually be small, but don't want the hard bound on it, then you can use this feature to force them together.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

Review status: 0 of 9 files reviewed at latest revision, 6 unresolved discussions.


column-families.md, line 9 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

This is only true if you use the feature, which I feel is an important distinction. If you've upgraded and haven't manually assigned families (even if you've created new tables), you can downgrade

Oh, right. I guess this statement will be true once we define families by default. I've revised the note for now. Look ok now?

column-families.md, line 15 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

single key-value pair per family

Done.

column-families.md, line 25 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I'm not really sure what we're contrasting to. If some competitor, which one? If previous versions of cockroach, this is not technically true, because we'd have an extra +1 per row for a total of 30

OK, refined this to contrast to earlier versions of cockroach. Let me know if you don't think this is useful.

column-families.md, line 77 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

optimal is far too strong a promise for what is coming : - )

maybe reasonable

Done.

Comments from Reviewable

@jseldess
Copy link
Contributor Author

Review status: 0 of 9 files reviewed at latest revision, 6 unresolved discussions.


column-families.md, line 7 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

The even bigger advantage is that it reduces the number of keys you're changing during inserts/updates/deletes, which has all kinds of performance benefits. Less data over the wire, smaller raft and rocksdb logs, fewer write intents to conflict. Basically, this is even a bigger deal for a 3 node setup than a single node.

Maybe "

Revised a bit. Let me know if you want more detail still.

column-families.md, line 52 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I don't think it's clear anywhere that you can specify more than one family. You might want to do this if you have one small column that gets updated a lot and one big column that doesn't. Because of the way families work, if you put both in one family, the big one will have to be rewritten every time the little one is updated. This general issue is why we won't simply always use one family

Important to emphasize: You should always use as few column families as reasonable. Our upcoming heuristic tries to group "bounded-size" columns (INT, TIMESTAMP, STRING(100), etc). Unfortunately, if the user specifies STRING, DECIMAL, BYTES without a max size, we have to be pessimistic and give it its own group. This is the major reason that users should expect to tell cockroach about family assignments. If you know that your STRING (or DECIMAL or BYTES) field will usually be small, but don't want the hard bound on it, then you can use this feature to force them together.

I added a "Column Family Recommendations" section to cover some of this. For now, though, I'm leaving out the details connected to our heuristics. I can put those in later.

Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 28, 2016

just a last couple notes


Review status: 0 of 9 files reviewed at latest revision, 8 unresolved discussions.


column-families.md, line 52 [r1] (raw file):

Previously, jseldess wrote…

I added a "Column Family Recommendations" section to cover some of this. For now, though, I'm leaving out the details connected to our heuristics. I can put those in later.

Agreed that we don't need the heuristics yet. I do think it's important to specify in which circumstances you'd need to hint to cockroach about family assignments. The most common one will be a STRING (or DECIMAL or BYTES) field that has no declared limit, but you know it will be small. Cockroach has to be pessimistic (maybe it will be huge, how do we know) but if you know otherwise, the override will be beneficial.

For example
CREATE TABLE users (id SERIAL PRIMARY KEY, name STRING, address STRING)

name and address with both be small enough that this should all be one family


column-families.md, line 74 [r3] (raw file):

| Table |                CreateTable                 |
+-------+--------------------------------------------+
| t30   | CREATE TABLE t30 (␤                        |

t30?


column-families.md, line 87 [r3] (raw file):

Column Family Recommendations

I only just remembered one more restriction on family assignments. Primary index columns must be placed in the first listed family (family #0).

> create table foo (a int primary key, b int, family (b), family (a));
pq: primary key column 1 is not in column family 0

Comments from Reviewable

@jseldess
Copy link
Contributor Author

Review status: 0 of 9 files reviewed at latest revision, 8 unresolved discussions.


column-families.md, line 52 [r1] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Agreed that we don't need the heuristics yet. I do think it's important to specify in which circumstances you'd need to hint to cockroach about family assignments. The most common one will be a STRING (or DECIMAL or BYTES) field that has no declared limit, but you know it will be small. Cockroach has to be pessimistic (maybe it will be huge, how do we know) but if you know otherwise, the override will be beneficial.

For example
CREATE TABLE users (id SERIAL PRIMARY KEY, name STRING, address STRING)

name and address with both be small enough that this should all be one family

Done.

column-families.md, line 74 [r3] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

t30?

Done.

column-families.md, line 87 [r3] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I only just remembered one more restriction on family assignments. Primary index columns must be placed in the first listed family (family #0).

> create table foo (a int primary key, b int, family (b), family (a));
pq: primary key column 1 is not in column family 0
Done.

Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 28, 2016

:lgtm:


Review status: 0 of 9 files reviewed at latest revision, 8 unresolved discussions.


Comments from Reviewable

@jseldess jseldess merged commit 64bac9e into gh-pages Jun 28, 2016
@jseldess jseldess deleted the column-families branch June 28, 2016 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants