Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary keys / index columns in tabular metadata files #1304

Closed
effigies opened this issue Sep 26, 2022 · 2 comments · Fixed by #1306
Closed

Primary keys / index columns in tabular metadata files #1304

effigies opened this issue Sep 26, 2022 · 2 comments · Fixed by #1306
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.

Comments

@effigies
Copy link
Collaborator

#1290 highlighted a rule that got added with microscopy that we have not considered in the schema. In Samples file:

The combination of sample_id and participant_id MUST be unique.

And of course in participants.tsv "There MUST be exactly one row for each participant."

In both of these cases, each row is metadata associated with an object identified by one or more of the columns, making those columns the primary key in relational database terms. The values (or tuple of values, if multiple columns) must be unique because they can be used to look up the row. While we could write rules and come up with expressions that would enforce these constraints, it seems less ad hoc to make this a declarative property of a table.

Here's a proposal:

Participants:
  selectors:
    - path == "participants.tsv"
  initial_columns:
    - participant_id
  columns:
    participant_id: require
    species: recommended
    age: recommended
    sex: recommended
    handedness: recommended
    strain: recommended
    strain_rrid: recommended
  index_columns: [participant_id]
  additional_columns: allowed

samples:
  selectors:
    - path == "samples.tsv"
  columns:
    sample_id: required
    participant_id: required
    sample_type: required
    pathology: recommended
    derived_from: recommended
  index_columns: [sample_id, participant_id]
  additional_columns: allowed

Tooling would then enforce (and perhaps take advantage of) this constraint.

I use index_columns because it feels a little less jargony than primary key, but we could also just use primary_key and document its meaning.

Note that there are some TSV files (such as events.tsv) where there is no primary key, so we can't get away with an implicit "first column is primary unless otherwise stated". Or we can, but then we would need to declare some file types as not having them.

@effigies effigies added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 26, 2022
@sappelhoff
Copy link
Member

I like index_columns, I think that'd be more clear to a larger group of people, and I also like explicitly declaring index columns.

@rwblair
Copy link
Member

rwblair commented Sep 29, 2022

I like index_columns as well. Only other constraint that I'd like to see is that any column used as an index must be a required column, I feel like that keeps the meaning of the rule simpler since there's something else guaranteeing the columns existence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants