Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arch finalization proposal #25

Merged
merged 36 commits into from
Nov 8, 2023
Merged

Conversation

dchandan
Copy link
Collaborator

@dchandan dchandan commented Oct 6, 2023

Here is my proposal for finalizing the architecture of STAC populator. I have chosen to branch from arch-changes rather than overlay my changes to that branch, so once we finalize this PR we can merge it back into arch-changes and close PR #16.

There is currently only one implementation, for CMIP6 data residing on the UofT node. We expect more implementations to follow soon. Ouranos will probably want to have an implementation for CMIP6 data on the Ouranos node. I am also working on two other implemetations, one for the NEX-GDDP data and one that populates our catalog with data from other catalogs (this implemetation can probably be used verbatim by more than one Marble node.)

Key changes:

  1. Extracting pydantic models into a separate file
    I moved the Pydantic models from stac_utils.py to a new models.py file.

  2. Removing cmip6 extension
    After examination of the code I came to be of the opinion that the cmip6 extension was not adding much value and could be removed. Here's why:

    • By the time the call to the extension was made from create_stac_item all appropriate CMIP6 properties were already added to the pystac.Item instance from the Pydantic data model that describes the CMIP6 Item Properties (CMIP6ItemProperties in add_CMIP6.py). Since the extension was simply adding properties to the Item that were already being added by the Pydantic model, in its form it was not adding any additional value.
    • Creating an extension requires one to define a schema and point to the URI for that schema. This is an additional complication that may be be a barrier to those writing future ingestion scripts and so it would be best to avoid the use of this step in an example implementation.
    • There was nothing else in the code that required the presence of this extension. Therefore, if the extension itself was not adding anything and it was not strictly necessary then it made sense to remove it.
    • This allowed the removal of a lot of duplicate code.
  3. Breaking up CFJsonItem
    I felt that the CFJsonItem was doing two things that didn't quite fit together within one class. On the one hand it contained logic for the creation of a new pystac.Item object from CF metadata and on the other hand there was logic that was specifically for use by the pystac.extensions.datacube extension. The former is applicable to all items that are created based on crawling the THREDDS catalog, but the datacube extension code is not applicable to all STAC items because the extension itself is not applicable to all types of data (or maybe one doesn't want to use the extension even for data to which it is applicable). Therefore, having the extension logic in that class that gets instantiated for every catalog item didn't make sense to me.

    So, I dissolved CFJsonItem and extracted its logic as follows: code that pertains to the creation of STAC item (more specifically a pystac.Item class representing a STAC item) is in a new function in stac_utils.py called STAC_item_from_metadata while code meant to support the datacube extension is now in a new file extensions.py in the CMIP6-UofT implementations folder. I moved that code to the implementations folder rather than in the core section because I am unsure how generally applicable this code would be to other types of model output data, but this could be changed.

@dchandan dchandan changed the base branch from master to arch-changes October 6, 2023 18:52
@dchandan
Copy link
Collaborator Author

dchandan commented Oct 6, 2023

To do: Datamodel extension wasn't working properly from the very start. We still have to get it to work.

@dchandan dchandan requested review from huard and Nazim-crim October 10, 2023 03:20
@fmigneault
Copy link
Collaborator

@dchandan
Not sure if you were already aware of this, but the is a STAC Extension for CF attributes: https://github.com/stac-extensions/cf.
I believe this is the notation that CFJsonItem should use, not specifically applicable for CMIP6 or Datacube extensions, but any CF attribute used in NetCDF.

All code in implementations should be moved under STACpopulator so that they can be imported externally.
Right now, they are not installed in the same module, which makes it unusable by other code base that could reuse the implementations (eg: https://github.com/crim-ca/ncml2stac/blob/main/notebooks/ncml2stac.ipynb).
For this notebook, I am particularly interested in avoiding to duplicate the logic in CMIP6populator.create_stac_item.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All typing dict or list should be more specific types with dict[<key>, <items>] and list[<item>] notation.

implementations/CMIP6-UofT/extensions.py Outdated Show resolved Hide resolved
implementations/CMIP6-UofT/add_CMIP6.py Outdated Show resolved Hide resolved
implementations/CMIP6-UofT/extensions.py Outdated Show resolved Hide resolved
implementations/CMIP6-UofT/extensions.py Outdated Show resolved Hide resolved
STACpopulator/populator_base.py Outdated Show resolved Hide resolved
STACpopulator/models.py Outdated Show resolved Hide resolved
STACpopulator/populator_base.py Outdated Show resolved Hide resolved
Comment on lines 129 to 157
media_types = {
"httpserver_service": "application/x-netcdf",
"opendap_service": pystac.MediaType.HTML,
"wcs_service": pystac.MediaType.XML,
"wms_service": pystac.MediaType.XML,
"nccs_service": "application/x-netcdf",
"HTTPServer": "application/x-netcdf",
"OPENDAP": pystac.MediaType.HTML,
"NCML": pystac.MediaType.XML,
"WCS": pystac.MediaType.XML,
"ISO": pystac.MediaType.XML,
"WMS": pystac.MediaType.XML,
"NetcdfSubset": "application/x-netcdf",
}

asset_roles = {
"httpserver_service": ["data"],
"opendap_service": ["data"],
"wcs_service": ["data"],
"wms_service": ["visual"],
"nccs_service": ["data"],
"HTTPServer": ["data"],
"OPENDAP": ["data"],
"NCML": ["metadata"],
"WCS": ["data"],
"ISO": ["metadata"],
"WMS": ["visual"],
"NetcdfSubset": ["data"],
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like THREDDS-specific metadata. They should most probably be attributes within its implementation.

Comment on lines +45 to +74
# class Link(BaseModel):
# """
# https://github.com/radiantearth/stac-spec/blob/v1.0.0/collection-spec/collection-spec.md#link-object
# """

# href: str = Field(..., alias="href", min_length=1)
# rel: str = Field(..., alias="rel", min_length=1)
# type: Optional[str] = None
# title: Optional[str] = None
# # Label extension
# label: Optional[str] = Field(None, alias="label:assets")
# model_config = ConfigDict(use_enum_values=True)

# def resolve(self, base_url: str) -> None:
# """resolve a link to the given base URL"""
# self.href = urljoin(base_url, self.href)


# class PaginationLink(Link):
# """
# https://github.com/radiantearth/stac-api-spec/blob/master/api-spec.md#paging-extension
# """

# rel: Literal["next", "previous"]
# method: Literal["GET", "POST"]
# body: Optional[Dict[Any, Any]] = None
# merge: bool = False


# Links = RootModel[List[Union[PaginationLink, Link]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this gets lost after conversion?

At the very least, the converted STAC Item should contain the rel: source with href point at the original NetCDF file and the rel: describes with href pointing at the NCML. In the case of rel: source, it is also very important that title contains the "path" of the file without the THREDDS URL prefix and the "service" path element (dodsC, fileServer, etc.).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't quite understand your point. The access URLs are contained as assets to the item. This is code that David had but it didn't look like it was being used, so I left it there but commented it out.

@huard would you like to pitch in here with a description of what you were doing and if that was complete? Also, how we can include Francis's suggestion?

Copy link
Collaborator

@fmigneault fmigneault Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the STAC Item is created (POST'd to the API), some resources in Magpie are generated under the stac service to manage its permissions.
Since that item refers to a NetCDF under THREDDS, a corresponding resource exists under thredds service to manage its permissions as well.
Using bird-house/birdhouse-deploy#386, the intent is that when a permission is updated on one, it will synchronize with the other, such that viewing the NetCDF metadata from either service is allowed/denied in the same way for a given user or group.

The problem in doing the above is that the URLs of these 2 services are not "compatible".
The generated resources would be nested like so (for simplicity, I omit the nested-directory handling below) :

stac
  collections
    <directory-uuid>
      items
        <another-uuid>
thredds
  <directory-name>
    <netcdf-name>

Since one works with UUIDs and the other with path names, there is no direct way to guess the references only from the URLs. Therefore, the STAC hook will use one of the references in links, namely the rel: source one, to obtain the STAC Asset URL of the original THREDDS NetCDF location. To establish the "link" between the stac and thredds Magpie services, it will use the title value from the rel: source link to establish a resource_display_name in Magpie. That parameter will be readable from Cowbird later on to resolve the stac UUID <-> thredds name resource relationship.

For example, the following catalog reference:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/xclim/cmip6/catalog.html?dataset=birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc
Should have (at least) the following link in the STAC Item

{
  "rel": "source",
  "title": "birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc",
  "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc",
  "type": "application/x-netcdf"
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that explanation @fmigneault. This makes sense. Let me implement this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay done. If all looks good then I will complete the merge.

@huard
Copy link
Collaborator

huard commented Oct 10, 2023

@dchandan I realize based on your comments that I may have not communicated properly the fact that the CMIP6 extension was meant to replace the Datamodel logic, not duplicate it. I just didn't remove yet the logic in add_CMIP6.py while I was still experimenting with extensions.

So extensions might be a bad idea, but not because it replicates existing code (points 2.1 and 2.4).

Also, I did not have to define a schema to get it to work, so not sure about this being an additional complication. If needed, pydantic can spit out the schema for us.

Basically, embedding the CMIP properties in an extension is just a way to formalize how we're adding CMIP properties into the STAC item, so that we're using the same mechanism to "extend" STAC items (datacube, cmip6, and eventually maybe other extensions).

Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting failures regarding the datacube extension.

STACpopulator/input.py Outdated Show resolved Hide resolved
@dchandan
Copy link
Collaborator Author

@huard Thanks for the clarification!

I think that the scheme URI is required. In your code it was specified in SCHEMA_URI and used later by the extension via the class method get_schema_uri. Since that specific URI does not work (I get a 404), I am inclined to think that there is no validation performed using the schema, or even a check to see if the schema exists, so perhaps in that sense maybe a working schema is not required. But I think, if one is using extensions then one should have a properly constructed schema.

You're right about Pydantic's ability to spit out a schema coming in handy for this task and thereby reducing the barrier. But if one has to create a pydantic model anyways to generate a schema then I don't see why not just use that model rather than create another extension and then call on the pydantic model to generate the schema for use in the extension. At least in this particular case I don't see the point since the extension would not be doing anything more logically sophisticated than what a pydantic model does. Maybe in some other case, for a different type of data, one might want an extension to incorporate additional logic. But that decision can, and in my opinion, should be case by case.

(As a side note, it is my opinion that the way PySTAC has implemented STAC extensions is very non-pythonic which makes working with extensions a very unsatisfactory experience.)

@dchandan
Copy link
Collaborator Author

I'm getting failures regarding the datacube extension.

Yeah, that extension still has to be debuged. It was not working for me from the start, and it is likely that a couple things might have broken during the rearrangement.

@huard
Copy link
Collaborator

huard commented Oct 12, 2023

I tend to agree with your assessment, but being unfamiliar with STAC, I thought it was more prudent to architect stuff using existing pathways. In practice, maybe it doesn't change anything, I'm just not confident I know the answer to this question.

In any case, there are a couple of points that could be streamlined in a future PR.

  • Serializing numpy types. I suspect this could probably de done somewhere in xncml.
  • Setting the serialization_alias for every property seems overly verbose. I think we could just export the dict and then modify the keys to add the prefix before ingestion into the STAC item.

@dchandan
Copy link
Collaborator Author

  • Serializing numpy types. I suspect this could probably de done somewhere in xncml.

I agree.

  • Setting the serialization_alias for every property seems overly verbose. I think we could just export the dict and then modify the keys to add the prefix before ingestion into the STAC item.

I have pushed a change with a new way in which the prefix is applied. Check it out and let me know your thoughts.

@huard
Copy link
Collaborator

huard commented Oct 13, 2023

Looks good !

@dchandan
Copy link
Collaborator Author

All typing dict or list should be more specific types with dict[<key>, <items>] and list[<item>] notation.

All changes done, except in the implementations/CMIP6-UofT/extensions.py file which can be done in another PR related to making the Datacube extension work.

@dchandan
Copy link
Collaborator Author

All code in implementations should be moved under STACpopulator so that they can be imported externally.
Right now, they are not installed in the same module, which makes it unusable by other code base that could reuse the implementations (eg: https://github.com/crim-ca/ncml2stac/blob/main/notebooks/ncml2stac.ipynb).
For this notebook, I am particularly interested in avoiding to duplicate the logic in CMIP6populator.create_stac_item.

Done

@dchandan
Copy link
Collaborator Author

Also, I would rather have all the parsing logic in the stac-populator, this way we can be sure the result is identical on both ends.

I don't know what both ends means. I don't follow your use case. The NCML to STAC conversion is a workflow that happens on the server side initiated by the admin to populate the STAC database. That code here does that job properly (or are you saying that there is a deficiency in the STAC representation it creates, which would be useful to know).

We can chat about what exactly you are trying to do tomorrow over a zoom call.

@dchandan
Copy link
Collaborator Author

I think I've figured out what's going on. The issue is with the idiosyncrasies of how the THREDDS server returns metadata based on how the metadata is queried. We can discuss this on the call today.

@dchandan
Copy link
Collaborator Author

@fmigneault I've made changes to allow the kinds of workflow we discussed last Friday. Please check it out. The creation of a single STAC item can be tested using tests/test_standalone_stac_item.py.

tests/ref.txt Outdated Show resolved Hide resolved
tests/ref.txt Outdated
359.99493408203125,
89.74176788330078
],
"stac_extensions": []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing the cmip6 extension.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not using cmip6 extension. At least, in the UofT implementation, I am not going to use cmip6 extension.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand?
This is where all the cmip6: prefixed attributes are coming from.
This is how STAC Items represent extensions and their corresponding schemas for validation.

Using cmip6: without the extension essentially violates how STAC Items are represented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is stoping us from using https://github.com/TomAugspurger/cmip6?
I see a very strong overlap with all proposed properties.

Also, the https://github.com/stac-extensions/cf extension should be considered as well for any other parameter that would not be supported by the cmip6 extension.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already written about this above. I would recommend taking a look at those arguments above. Adding extensions, when not strictly necessary, adds complexity that makes the implementation a bad example to others for how to "easily" add items to STAC. There is nothing preventing another implementation for using an extension, I simply want to have an example that does not need to use extensions for a user in future who might not want to use extensions.

Using cmip6: without the extension essentially violates how STAC Items are represented.

Not sure what you mean here. As far as I can tell, there is nothing that requires properties with a prefix style to have associated extension. I've seen APIs that do not. Are you saying that as far as you know, doing this breaks something?

Copy link
Collaborator

@fmigneault fmigneault Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the recommendations of STAC specification to use existing extensions for reusability and documentation. It is the main purpose of STAC to make content accessible in a standardized manner. Since both cmip6 and cf extensions exist, they should be used instead of custom definitions. Using custom properties without any referenced extension and undefined by any JSON-spec that can validate them does not make it easier for users/web-clients to find the attributes, it complicates it on the STAC API side. STAC are even working with OGC to align redundant standards and reduce custom implementations essentially documentating the same data. Using un-prefixed properties will not "break" requirements, but it will strongly go against the STAC phylosophy. It is also better to guide users in employing recommended "best practices" instead of using workarounds.

I would like to stress that this implementation is not only for UoT or a demo showcase. This exact implementation would be used by CRIM and Ouranos to integrate it in actual instances. Therefore, I would prefer that we all streamline the STAC generation approach with a common strategy across all organizations.

tests/ref.txt Outdated
"rel": "source",
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc",
"type": "application/x-netcdf",
"title": "birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing the thredds: prefix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, here thredds: refers to the specific service name from https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/. If another THREDDS service was hosted as https://pavics.ouranos.ca/twitcher/ows/proxy/custom-thredds/, the title would have custom-thredds:path/to/file.nc instead.

Comment on lines +14 to +21
def test_standalone_stac_item():
url = (
"https://pavics.ouranos.ca/twitcher/ows/proxy/"
"thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
"?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2F"
"thredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html"
"&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use @pytest.mark.parametrize and test for multiple variations of this URL (combinations with thredds/ncml/, thredds/catalog/, .xml, .html, with/without the catalog query parameter).

In each case, it should always result in the same STAC Item.

@@ -87,15 +71,26 @@ def item_properties_model(self):
models.STACItemProperties."""
pass

@property
@abstractmethod
def item_geometry_model(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing typing

Comment on lines 79 to 83
pass

@abstractmethod
def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases were these method/properties are always called by the class, they should do raise NotImplementedError.
Using pass will make those calls silently pass and yield another error much later that can be harder to analyse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true. If an abstract method is not defined always a TypeError is raised. Does not matter at all what is the content of the abstract method in the abstract class.

Copy link
Collaborator

@fmigneault fmigneault Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right for the typical case of calling STACpopulatorBase() directly.
However, it doesn't behave the same depending on how the ABC gets called.

import abc

class BaseABC(abc.ABC):
    @abc.abstractmethod
    def method(self):
        pass

class BaseRaise(object):  # doesn't "require" ABC, but can be added as extra validation
    @abc.abstractmethod
    def method(self):
        raise NotImplementedError

class DeriveABC(BaseABC):
    def method(self):
        print("DeriveABC.method")

class DeriveRaise(BaseRaise):
    def method(self):
        print("DeriveABC.method")

BaseABC.method(DeriveABC())  # nothing printed, no error, the abstract method was called

BaseRaise.method(DeriveRaise())  # error below
Traceback (most recent call last):
  File "/home/francis/dev/conda/envs/weaver/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-f234a81a6388>", line 1, in <module>
    BaseRaise.method(DeriveRaise())
  File "<ipython-input-5-d3d4d159a3cb>", line 4, in method
    raise NotImplementedError
NotImplementedError

Sometimes, the BaseABC.method(DeriveABC()) format is needed to handle specific MRO conditions (such as when I was overriding the classes in my notebook). Using only pass, it would not flag that the abstract class was called directly, probably not something that was done intentionnaly.

Anyway.
I believe raise NotImplementedError is much more indicative of the intension "you must implement this", and if somehow ABC is omitted, it will still report the error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay interesting use case..... alright, will change to raising error.

Comment on lines +23 to +25
attrs = xncml.Dataset.from_text(requests.get(url).content).to_cf_dict()
stac_item_id = make_cmip6_item_id(attrs["attributes"])
stac_item = STAC_item_from_metadata(stac_item_id, attrs, CMIP6ItemProperties, GeoJSONPolygon)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a strong relationship between those components.
STAC_item_from_metadata can only parse attributes if they respect exactly the format returned by xncml.Datase.to_cf_dict (i.e.: cfmeta = attrs["groups"]["CFMetadata"]["attributes"]).

The same applies for make_cmip6_item_id that expect a very specific set of attributes that come from xncml.Datase.to_cf_dict.

All this block should be a single CMIP6populator.get_stac_item method that takes the URL, and does everything necessary to obtain the STAC Item without any futher intervention.

Both make_cmip6_item_id and STAC_item_from_metadata should be converted to methods under CMIP6populator. There is no reason for them to exist elsewhere given how they heavily depend on the specific formats that CMIP6populator.item_properties_model and CMIP6populator.item_geometry_model defines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both make_cmip6_item_id and STAC_item_from_metadata should be converted to methods under CMIP6populator. There is no reason for them to exist elsewhere given how they heavily depend on the specific formats that CMIP6populator.item_properties_model and CMIP6populator.item_geometry_model defines.

Interesting suggestion... I can see what you mean. make_cmip6_item_id is easy, but STAC_item_from_metadata is tricky. I don't necessarily want to move most of that logic into CMIP6populator as there is a lot there that is common to making any STAC item. But maybe one will have to.... Let me chew on this. It might be that we keep this for a later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Some part of the logic should be under STACPopulatorBase.stac_item_from_metadata, since it applies to any STAC Item generation case (for example, handling the extents, date-time, and other generic metadata).
The CMIP6populator.stac_item_from_metadata specificatlly for converting CF attributes to STAC CMIP6 extension would then do something like:

class CMIP6populator(STACPopulatorBase):
    def stac_item_from_metadata(self, item_data):
        stac_base = super().stac_item_from_metadata(item_data)
        stac_cmip6 = # specific populator logic for CF...
        stac_item = # merge stac_base + stac_cmip6
        return stac_item

@dchandan
Copy link
Collaborator Author

dchandan commented Oct 24, 2023

You had commented earlier that:

The steps used to generate the STAC Item were simply:

ds = xncml.Dataset(input_ncml)
input_ncml_data = ds.to_cf_dict()
stac_item_id = make_cmip6_item_id(attrs["attributes"])
attrs["id"] = stac_item_id
stac_item = CFJsonItemNetCDF(stac_item_id, attrs, cmip6.Properties)

This does exactly that. I am not following the objection now....

@fmigneault
Copy link
Collaborator

fmigneault commented Oct 24, 2023

This does exactly that. I am not following the objection now....

Indeed.
But as we discussed in our 1-on-1 call, I would rather have a single operation without any specific logic related to generating the attributes from the NCML and for the STAC Item. What the notebook did before was not "better", just easier with less overrides, but it was still a temporary implementation until the logic is entirely self-contained by stac-populator.

@dchandan
Copy link
Collaborator Author

This does exactly that. I am not following the objection now....

Indeed. But as we discussed in our 1-on-1 call, I would rather have a single operation without any specific logic related to generating the attributes from the NCML and for the STAC Item. What the notebook did before was not "better", just easier with less overrides, but it was still a temporary implementation until the logic is entirely self-contained by stac-populator.

Okay. I will leave it for a later refactoring of code. Right now, I need to get this thing working and deploy it for our use.

@fmigneault
Copy link
Collaborator

Maybe it is the versions?
https://psl.noaa.gov/thredds/catalog/catalog.xml reports version="1.2" https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog.xml reports version="1.0.1"`

Otherwise, maybe the contents of the NetCDF themselves differ.
I am not familiar enough with the metdata those files contain.
@huard ?

@dchandan
Copy link
Collaborator Author

Maybe it is a version thing.

That complicates things. I can't use the query parameter based link anymore because different servers may behave differently. I will revert back to using access links by opening the catalog itself using siphon. My tests show that way works with both of these servers.

@fmigneault
Copy link
Collaborator

Check if you can get the same results by overriding siphon.catalog.TDSCatalog._process_dataset to a no-op method, or something that limits how much crawling it does on __init__.

If the THREDDS server has a lot of datasets, it starts parsing everything right away, which is quickly inefficient if the purpose is only to describe a single NCML file or a subset of a specific dataset.
I tried with siphon.catalog.TDSCatalog("psl.noaa.gov/thredds/catalog/catalog.xml") and it takes some time to load 3759 datasets...

Even using siphon.catalog.TDSCatalog, I think you will need the query parameter because the URL of the NCML itself ("pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc) does not resolve the reference of its parent dataset.

@huard
Copy link
Collaborator

huard commented Oct 27, 2023

So I don't understand why there are differences with and without query parameters. I found out empirically by inspecting the URL of the NcML service exposed in the catalog web page, and have been using it since because the response contains more information.

@dchandan dchandan merged commit 26fe4ad into arch-changes Nov 8, 2023
1 check failed
@dchandan
Copy link
Collaborator Author

dchandan commented Nov 8, 2023

Closed as this was getting too long. Will revisit outstanding issues in future PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants