-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: support collations #2473
Comments
I think this issue should be labeled as a bug, not enhancement. Example:create table accounts(id int primary key, balance decimal, name string(64));
insert into accounts values(1,decimal '10.01', 'Slm1');
insert into accounts values(1,decimal '20.02', 'Şlm2');
insert into accounts values(3,decimal '30.03', 'Ümran');
select * from accounts order by name;
+----+---------+--------------+
| id | balance | name |
+----+---------+--------------+
| 1 | 10.01 | Slm1 |
| 3 | 30.03 | "\u00dcmran" |
| 2 | 20.01 | "\u015elm2" |
+----+---------+--------------+ It doesn't understand |
The byte-by-byte string comparison lives in Line 505 in be463fc
|
Here's a proposal for adding collation to CockroachDB, based on PostgreSQL's documentation. (I misunderstood the feature request before.) Add a language tag to Add a language tag to Add a language tag to String operations take the language tag into account as follows.
The type checker catches all of the new errors statically. Computing tags is a straightforward bottom-up traversal. Known difficulties Sorting strategies need to consider language tags, since the KV store orders primary keys by bytes, not collation. Collations where two strings with different bytes are equal seem like a headache. I'm particularly worried about primary keys. We should probably not support these collations at first. |
In offline discussion, Vivek raised the question of whether we want to change the key encoding for specially collated columns so that it's possible to extract ranges efficiently. The obvious encoding technique is to emit the collation key followed by the string itself, but this can double the storage needed and doesn't handle collations where byte-unequal strings collate equal. Since we're choosing a data format, we should get this right. I tried to figure out what PostgreSQL does, but the only relevant piece of documentation that I could find is cryptic.
LIKE presumably can't use an index because of combining characters. |
I think it's important that queries using a non-C collation can be fast (i.e. use an index), so indexes should store the collation key. I think using the collation key solves the problem of languages where byte-unequal strings compare equal. Such strings would have byte-equal collation keys. |
We could avoid the double-storage by only storing the collation key in the index, and going back to the primary data row for the full string (a sort of anti-covering index). |
I think it's a bad idea to avoid the double storage if it makes these keys non-covering. The whole point of an index is to trade size for speed. I think the big problem with the double storage is that it'll break all the existing on-disk strings. There's a number of other on-disk formats we have discussed or would like to change but haven't since there's no good way to do that except for a full SQL dump and import. So if we decide to change on-disk stuff for this, maybe we should figure out a more general way to do these migrations. |
@mjibson I don't think we have to break existing on-disk strings. The collation key would be stored in the index key while the raw key would be stored in the value. The |
I think it makes sense to implement this in three PRs.
I'm working on 1 currently. |
Sounds good to me. |
I think decimals may benefit from a similar split across the key and value: |
Yay! congrats on fixing this issue! |
@eisenstatdavid do you mind taking a quick look at Ben's comment to see if we should break this into some new issues? Thanks!! |
It looks like Go has fairly good support for collations: https://godoc.org/golang.org/x/text/collate. The challenge is to plumb through the use of the collation everywhere we're performing string comparisons.
The text was updated successfully, but these errors were encountered: