-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JTS] Primary Key / ID attribute support in JSON table schema #21
Comments
@mk270 thoughts here? |
If we do it at all, let's do it in a later version |
AFAICT from my research with json schema there is to provision for this in json schema (there's this discussion but nothing in the spec ...) Key question:
|
Propose: { fields: [ { id: ... type: ... primarykey: true ] } Decision: we allow multiple fields to make up the primary key but RECOMMEND having only one. Allowed to have no fields set as primarykey. |
@davidmiller - any thoughts? |
I note though json schema has no primary key it has required support and does not put it on each element but rather outside in a list, see e.g. http://json-schema.org/example1.html
|
@maxogden any thoughts here? (This is definitely needed for |
it's not required but I support it (e.g. if you don't specify one then uuids will be generated for your rows for you) also, how would it work exactly to specify multiple primary keys? would they get concatenated together to form one meta-primary-key or that they can be used individually/optionally? e.g. if I had a csv with columns i'm not sure if tools should allow to interpret multiple columns as one composite key or if that is outside the scope |
@maxogden I'd planned that you could have multiple primary keys (set primarykey=true on multiple fields) but that you should not (i.e. heavily discouraged). Interpretation of multiple primary keys would be up to the client but natural one would be to have a composite key (aside: I'd thought about this in terms of an API for data.okfn.org where you have data.okfn.org/{dataset}/{file}/id/{primary-key} and think you'd do for composite stuff {col-1}/{col-2} etc) |
@paulfitz interested in your thoughts here. Hope to close this and push in next day or so. |
@rgrp I'd vote for allowing (at most) one primary key, and allowing it to be a composite of multiple columns. I've found this to be common and useful practice. With Coopy I started off not covering this case, but then needed to when dealing with real-life data. For reference, Sqlite's treatment of specifying a primary key is as follows:
If multi-column primary keys end up allowed by the spec, then I'd suggest doing it wholeheartedly, and not put in language discouraging them or leaving interpretation to the client. Just so everyone knows what they can rely on for real. |
@paulfitz so agree on multi-column primary keys (but only one primary key). What do you think of doing inline i.e. primarkey=true (this is sqlalchemy style) on one or more fields or outside of the fields list (more sql constraint style ...?) |
@rgrp inline is ok. Outside-of-fields style could eventually be simpler when dealing also with foreign keys (including composites), indexes, uniqueness, etc. I went with modeling just single-column foreign keys in Coopy and ended up regretting it. But for the primary key, inline is ok, since there's just one primary key and the order of columns in it doesn't matter. |
A primary key is an index. Most database systems I know of specify indexes separately from schema. So, I would prefer to have another top-level field (alongside Also, you can only ever have one primary key (though it may be a composite key). I know of no database system that allows for multiple "primary" keys (what would that even mean?). |
@jpmckinney what do you think of Sqlite's compromise, where a simple single-column primary key can be expressed directly in the column definition, while still having more flexible methods of describing constraints and indexes in the general case (like your |
The goal of SQLite's table definition format is to optimize table creation (the most common interaction people have with the table syntax), so it offers developers many ways to write short-hand. An important goal of Also, when introducing a new feature, I'm generally in favor of supporting one way of using that feature, and waiting for real demand to grow for alternatives, short-hands, etc. |
@jpmckinney i'm not sure that reading is prioritized over writing exactly - though I get your point. I'm still at least +0 on the shorthand primarykey: true or similar but appreciate your point. I'm going to comment in #23 now and I hope with that one closed we can decide whether primarykey option makes sense ... |
Howdy guys I just wanted to add my recommendation here based on a few things.
I don't think adding any kind of key description at the field level is correct. It's not a matter of shorthand either as structurally all keys are properties of a datasets and not a fields.
Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique. IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that we're defining. As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:
|
@besquared +1 to everything. Happy to call them |
just a note, in dat i've been experimenting with this API:
meaning for this CSV:
the key for that row would be:
but since
upside is that I get unique ID's, downside is that I have to know all 4 columns in order to retrieve the data for that row |
Hmm, simple concatenation is naive. Imagine another row where description is |
In the stuff I'm writing I generate an internal row id using a purely random UUID (version 4) or the SHA1 hash of the concatenated primary key values for that row. I can't think of a case where someone wouldn't have either the internal record id (via some kind of CLI or UI/URL) or all 4 of the values at hand (via a foreign key association for instance). It also makes it much more convenient for things like URLs. If everyone agrees on the row id generation scheme I could theoretically pass that hashed row id to another service and it could look up the matching records. Of course I could also pass huge lists of strings around too but that seems worse. |
@besquared You can hash the concatenated string - that's fine - all I'm saying in my comment is that simply concatenating column values is a naive approach, as it doesn't delimit the values in any way, and I give an example where the lack of a delimiter causes a key collision where there should be no collision. |
Sorry, I meant that I was agreeing with you. I'm glad you pointed out that concatenation alone isn't enough. :) |
OK, so how about this for a final proposal:
Here's what it looks like:
rfc @maxogden @jpmckinney @besquared @paulfitz @sballesteros |
|
+1 for camel case since it is the default on JavaScript (and JSON comes from JS...). |
@turicas good to hear the +1 on camel case |
+1 camelCase (not my favorite but consistency is important) |
w00t! @besquared @jpmckinney @maxogden @paulfitz believe I've captured the agreed consensus here but welcome review of the change: 128726b |
Looks good! |
Looks great to me! Already implemented the Mode Ruby data packages library. I'll get around the string version soon. |
Ability to specify a field(s) as primary key / id field.
Proposal
Questions
Relationship to a possible distinct attribute "unique"
TODO: research / compare with other specs e.g. SQL, bigquery etc
The text was updated successfully, but these errors were encountered: