Primary keys / index columns in tabular metadata files #1304

effigies · 2022-09-26T13:46:11Z

#1290 highlighted a rule that got added with microscopy that we have not considered in the schema. In Samples file:

The combination of sample_id and participant_id MUST be unique.

And of course in participants.tsv "There MUST be exactly one row for each participant."

In both of these cases, each row is metadata associated with an object identified by one or more of the columns, making those columns the primary key in relational database terms. The values (or tuple of values, if multiple columns) must be unique because they can be used to look up the row. While we could write rules and come up with expressions that would enforce these constraints, it seems less ad hoc to make this a declarative property of a table.

Here's a proposal:

Participants:
  selectors:
    - path == "participants.tsv"
  initial_columns:
    - participant_id
  columns:
    participant_id: require
    species: recommended
    age: recommended
    sex: recommended
    handedness: recommended
    strain: recommended
    strain_rrid: recommended
  index_columns: [participant_id]
  additional_columns: allowed

samples:
  selectors:
    - path == "samples.tsv"
  columns:
    sample_id: required
    participant_id: required
    sample_type: required
    pathology: recommended
    derived_from: recommended
  index_columns: [sample_id, participant_id]
  additional_columns: allowed

Tooling would then enforce (and perhaps take advantage of) this constraint.

I use index_columns because it feels a little less jargony than primary key, but we could also just use primary_key and document its meaning.

Note that there are some TSV files (such as events.tsv) where there is no primary key, so we can't get away with an implicit "first column is primary unless otherwise stated". Or we can, but then we would need to declare some file types as not having them.

The text was updated successfully, but these errors were encountered:

sappelhoff · 2022-09-27T09:16:54Z

I like index_columns, I think that'd be more clear to a larger group of people, and I also like explicitly declaring index columns.

rwblair · 2022-09-29T17:35:00Z

I like index_columns as well. Only other constraint that I'd like to see is that any column used as an index must be a required column, I feel like that keeps the meaning of the rule simpler since there's something else guaranteeing the columns existence.

effigies added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 26, 2022

This was referenced Sep 30, 2022

[ENH] BEP030: Functional Near-Infrared Spectroscopy #802

Merged

SCHEMA: Add index_columns metadata, render in tables #1306

Merged

effigies closed this as completed in #1306 Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary keys / index columns in tabular metadata files #1304

Primary keys / index columns in tabular metadata files #1304

effigies commented Sep 26, 2022

sappelhoff commented Sep 27, 2022

rwblair commented Sep 29, 2022

Primary keys / index columns in tabular metadata files #1304

Primary keys / index columns in tabular metadata files #1304

Comments

effigies commented Sep 26, 2022

sappelhoff commented Sep 27, 2022

rwblair commented Sep 29, 2022