Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy selection of user defined catalogs #245

Open
charles-turner-1 opened this issue Nov 12, 2024 · 5 comments
Open

Easy selection of user defined catalogs #245

charles-turner-1 opened this issue Nov 12, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

This builds on the solution to #191 in #243.

With the changes introduced by #243, users are able to build & query their own catalogs by placing a catalog file in $HOME/.access_nri_intake_catalog/catalog.yaml, and this will be preferentially loaded over the default catalog at /g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml.

This default catalog looks something like (NB. using the old version numbering)

sources:
  access_nri:
    args:
      columns_with_iterables:
      - model
      - realm
      - frequency
      - variable
      mode: r
      name_column: name
      path: /g/data/xp65/public/apps/access-nri-intake-catalog/{{version}}/metacatalog.csv
      yaml_column: yaml
    description: ACCESS-NRI intake catalog
    driver: intake_dataframe_catalog.core.DfFileCatalog
    metadata:
      storage: gdata/fs38+gdata/oi10+gdata/tm70
      version: '{{version}}'
    parameters:
      version:
        default: v0.1.3
        description: Catalog version
        type: str

and a user defined catalog will look something like

sources:
  access_nri:
    args:
      columns_with_iterables:
      - model
      - realm
      - frequency
      - variable
      mode: r
      name_column: name
      path: $DIR/{{version}}/metacatalog.csv
      yaml_column: yaml
    description: ACCESS-NRI intake catalog
    driver: intake_dataframe_catalog.core.DfFileCatalog
    metadata:
      storage: gdata/al33+gdata/rr3+gdata/tm70
      version: '{{version}}'
    parameters:
      version:
        default: v2024-11-11
        description: Catalog version
        max: v2024-11-11
        min: v2024-11-08
        type: str

where $DIR is a directory the user has placed their metacatalog in.

These changes represent a big step forward in terms of the users ability to use bespoke catalogs. However, the architecture of intake is such that presently, if a user wished to compare catalogs/data obtained from catalogs, it would be necessary to:

  1. Move the user defined catalog to a new filename, eg
    $ mv ~/.access_nri_intake_catalog/catalog.yaml ~/._access_nri_intake_catalog/catalog.yaml .
  2. Restart the python interpreter / jupyter kernel.
  3. Import the catalog.

This might create issues for users who wish to compare their custom catalog with the default catalog, and it can be made easier.

Describe the feature you'd like

Intake allows a single catalog to describe multiple sources: ie, the two catalogs above could be combined as

sources:
  access_nri:
    args:
      columns_with_iterables:
      - model
      - realm
      - frequency
      - variable
      mode: r
      name_column: name
      path: /g/data/xp65/public/apps/access-nri-intake-catalog/{{version}}/metacatalog.csv
      yaml_column: yaml
    description: ACCESS-NRI intake catalog
    driver: intake_dataframe_catalog.core.DfFileCatalog
    metadata:
      storage: gdata/fs38+gdata/oi10+gdata/tm70
      version: '{{version}}'
    parameters:
      version:
        default: v0.1.3
        description: Catalog version
        type: str
  user_def:
    args:
      columns_with_iterables:
      - model
      - realm
      - frequency
      - variable
      mode: r
      name_column: name
      path: $DIR/{{version}}/metacatalog.csv
      yaml_column: yaml
    description: ACCESS-NRI intake catalog
    driver: intake_dataframe_catalog.core.DfFileCatalog
    metadata:
      storage: gdata/al33+gdata/rr3+gdata/tm70
      version: '{{version}}'
    parameters:
      version:
        default: v2024-11-11
        description: Catalog version
        max: v2024-11-11
        min: v2024-11-08
        type: str

This would then allow the user to perform the following operations:

>>> import intake
>>> intake.cat.access_nri
<access_nri catalog with 94 source(s) across 2272 rows>
>>> intake.cat.user_def
<user_def catalog with 2 source(s) across 849 rows>

Doing so requires an additional entry point for the user_def catalog, & so we would additionally require the following changes in pyproject.toml:

[project.entry-points."intake.catalogs"]
access_nri = "access_nri_intake.data:data"
+ user_def = "access_nri_intake.data:user_data"

and in src/access_nri_intake/data/__init__.py

try:
    data = intake.open_catalog(get_catalog_fp()).access_nri
   + user_data = intake.open_catalog(get_catalog_fp()).user_def
except FileNotFoundError:

Entry points are created at package build time & fixed, so realistically we would probably have to limit users to a single user defined catalog, unless we figure out a way to do some black magic to circumvent that limitation.

Describe alternatives you've considered

Leave as is - this might be an unnecessary addition.

Additional context

See #244 for sample implementation.

@charles-turner-1
Copy link
Collaborator Author

@marc-white @rbeucher what are your opinions on this?

@rbeucher
Copy link
Member

Sure, we could add a placeholder for user-built catalogs.

What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.

What do you think?

@marc-white
Copy link
Collaborator

The main issue I see is that we have to 'pre-can' all of the information about the user's potential catalog - what guarantee do we have that this catalog information will match whatever the user comes up with?

Secondly, how would a user build their catalog? Would we need to provide updates to the existing access_nri_intake_catalog build scripts to add a "custom version" option?

Thirdly, is this actually necessary to do within the access_nri_intake_catalog ecosystem? If we have users who are advanced enough to be able to generate their own 'catalog of catalogs', either using a tool we provide or manually via intake, couldn't they load their custom catalog directly, rather than give it an access_nri alias?

What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.

I'm not sure this is technically feasible - the catalog content is read from the metacatalog.csv file, so we'd have to either find some way to 'modify' that in memory, or find some other way to append additional rows to an existing catalog.

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Nov 12, 2024

The main issue I see is that we have to 'pre-can' all of the information about the user's potential catalog - what guarantee do we have that this catalog information will match whatever the user comes up with?

I was envisaging a situation where the user would populate user_def section - we'd just be providing them an entry point to access this catalog that is distinct from the default access_nri one. I'm pretty sure that in this use case, this shouldn't be an issue. Effectively, we would just implement the changes in #244 and then direct the user with how to populate this entry point with data.

Secondly, how would a user build their catalog? Would we need to provide updates to the existing access_nri_intake_catalog build scripts to add a "custom version" option?

I think we would jut leave it up to the user to build their catalog however they see fit - eg. modifying build_all.sh etc. in order to generate a catalog. The aim of this would just be to allow them to swap back and forth between catalogs easily, and ideally helping avoid accidental use of the wrong catalog. I was fiddling with intake_dataframe_catalog this morning & didn't realise I still had ~/.access_nri_intake_catalog/catalog.yaml set. I can see this becoming a footgun.

Thirdly, is this actually necessary to do within the access_nri_intake_catalog ecosystem? If we have users who are advanced enough to be able to generate their own 'catalog of catalogs', either using a tool we provide or manually via intake, couldn't they load their custom catalog directly, rather than give it an access_nri alias?

Yeah, this is a really good point. Perhaps it would be better to direct users to load custom catalogs with intake.open_dataframe_catalog(...) somewhere in the docs.

What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.

I'm not sure this is technically feasible - the catalog content is read from the metacatalog.csv file, so we'd have to either find some way to 'modify' that in memory, or find some other way to append additional rows to an existing catalog.

I think it might actually be plausible to do this - I think we would just have to update the intake dataframe catalog driver to support multi-file catalogs. This would look something like

sources:
  access_nri:
    args:
      columns_with_iterables:
      - model
      - realm
      - frequency
      - variable
      mode: r
      name_column: name
      path:
      - /g/data/xp65/public/apps/access-nri-intake-catalog/{{version}}/metacatalog.csv
      - $MY_EPHEMERAL_CATALOG.csv
      yaml_column: yaml
    description: ACCESS-NRI intake catalog
    driver: intake_dataframe_catalog.core.DfFileCatalog
    metadata:
      storage: gdata/fs38+gdata/oi10+gdata/tm70
      version: '{{version}}'
    parameters:
      version:
        default: v0.1.3
        description: Catalog version
        type: str

Probably it would be quite a bit more involved than that to actually implement, but I think it should be doable.

@marc-white
Copy link
Collaborator

marc-white commented Nov 12, 2024

I was envisaging a situation where the user would populate user_def section - we'd just be providing them an entry point to access this catalog that is distinct from the default access_nri one. I'm pretty sure that in this use case, this shouldn't be an issue. Effectively, we would just implement the changes in #244 and then direct the user with how to populate this entry point with data.

I suppose we could tell the user to grab the 'real' catalog.yaml, put it in their home area, then populate it with their catalog info under the user_def heading? We can't let them populate the live catalog.yaml on xp65, otherwise that will affect everyone.

Yup, this is what I had in mind.

I think we would jut leave it up to the user to build their catalog however they see fit - eg. modifying build_all.sh etc. in order to generate a catalog. The aim of this would just be to allow them to swap back and forth between catalogs easily, and ideally helping avoid accidental use of the wrong catalog. I was fiddling with intake_dataframe_catalog this morning & didn't realise I still had ~/.access_nri_intake_catalog/catalog.yaml set. I can see this becoming a footgun.

Yes, the ghost local catalog concern did make me think. Do you think it's worth throwing a warning of some kind if we load the local catalog, rather than the real? That would at least slap most users in the face and remind them.

Yeah, absolutely.

Thirdly, is this actually necessary to do within the access_nri_intake_catalog ecosystem? If we have users who are advanced enough to be able to generate their own 'catalog of catalogs', either using a tool we provide or manually via intake, couldn't they load their custom catalog directly, rather than give it an access_nri alias?

Yeah, this is a really good point. Perhaps it would be better to direct users to load custom catalogs with intake.open_dataframe_catalog(...) somewhere in the docs.

I think that approach would minimize confusion between the 'real' catalog and the user's own Frankenstein's monster version, especially once users start sharing with each other (see below).

What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.
...
I'm pretty convinced this isn't a great idea. Consider the following situation:

  1. Researcher creates custom add-on catalog that gets patched into intake.cat.access_nri
  2. Researcher generates a Jupyter notebook to do some analysis on their conjoined catalog
  3. Researcher hands Jupyter notebook down to PhD student to work on, but (because they're a dotty researcher-type like me) forget that they have a custom catalog squashed into intake.cat.access_nri
  4. PhD student gets an 'experiment not found' error from the notebook, student pings us asking why an experiment is missing from the canonical catalog
  5. We spend an age trying to work out why something that was in catalog isn't any more, until we figure out it was never in the catalog to begin with

Much better, I think, to keep a clear delineation between what is canonically in (and, by exclusion, what isn't in) intake.cat.access_nri.

Yeah, thats an excellent point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

3 participants