Arch finalization proposal #25

dchandan · 2023-10-06T18:52:09Z

Here is my proposal for finalizing the architecture of STAC populator. I have chosen to branch from arch-changes rather than overlay my changes to that branch, so once we finalize this PR we can merge it back into arch-changes and close PR #16.

There is currently only one implementation, for CMIP6 data residing on the UofT node. We expect more implementations to follow soon. Ouranos will probably want to have an implementation for CMIP6 data on the Ouranos node. I am also working on two other implemetations, one for the NEX-GDDP data and one that populates our catalog with data from other catalogs (this implemetation can probably be used verbatim by more than one Marble node.)

Key changes:

Extracting pydantic models into a separate file
I moved the Pydantic models from stac_utils.py to a new models.py file.
Removing cmip6 extension
After examination of the code I came to be of the opinion that the cmip6 extension was not adding much value and could be removed. Here's why:
- By the time the call to the extension was made from create_stac_item all appropriate CMIP6 properties were already added to the pystac.Item instance from the Pydantic data model that describes the CMIP6 Item Properties (CMIP6ItemProperties in add_CMIP6.py). Since the extension was simply adding properties to the Item that were already being added by the Pydantic model, in its form it was not adding any additional value.
- Creating an extension requires one to define a schema and point to the URI for that schema. This is an additional complication that may be be a barrier to those writing future ingestion scripts and so it would be best to avoid the use of this step in an example implementation.
- There was nothing else in the code that required the presence of this extension. Therefore, if the extension itself was not adding anything and it was not strictly necessary then it made sense to remove it.
- This allowed the removal of a lot of duplicate code.
Breaking up CFJsonItem
I felt that the CFJsonItem was doing two things that didn't quite fit together within one class. On the one hand it contained logic for the creation of a new pystac.Item object from CF metadata and on the other hand there was logic that was specifically for use by the pystac.extensions.datacube extension. The former is applicable to all items that are created based on crawling the THREDDS catalog, but the datacube extension code is not applicable to all STAC items because the extension itself is not applicable to all types of data (or maybe one doesn't want to use the extension even for data to which it is applicable). Therefore, having the extension logic in that class that gets instantiated for every catalog item didn't make sense to me.

So, I dissolved CFJsonItem and extracted its logic as follows: code that pertains to the creation of STAC item (more specifically a pystac.Item class representing a STAC item) is in a new function in stac_utils.py called STAC_item_from_metadata while code meant to support the datacube extension is now in a new file extensions.py in the CMIP6-UofT implementations folder. I moved that code to the implementations folder rather than in the core section because I am unsure how generally applicable this code would be to other types of model output data, but this could be changed.

dchandan · 2023-10-06T18:59:17Z

To do: Datamodel extension wasn't working properly from the very start. We still have to get it to work.

fmigneault · 2023-10-10T16:27:20Z

@dchandan
Not sure if you were already aware of this, but the is a STAC Extension for CF attributes: https://github.com/stac-extensions/cf.
I believe this is the notation that CFJsonItem should use, not specifically applicable for CMIP6 or Datacube extensions, but any CF attribute used in NetCDF.

All code in implementations should be moved under STACpopulator so that they can be imported externally.
Right now, they are not installed in the same module, which makes it unusable by other code base that could reuse the implementations (eg: https://github.com/crim-ca/ncml2stac/blob/main/notebooks/ncml2stac.ipynb).
For this notebook, I am particularly interested in avoiding to duplicate the logic in CMIP6populator.create_stac_item.

fmigneault

All typing dict or list should be more specific types with dict[<key>, <items>] and list[<item>] notation.

implementations/CMIP6-UofT/extensions.py

implementations/CMIP6-UofT/add_CMIP6.py

implementations/CMIP6-UofT/extensions.py

STACpopulator/populator_base.py

STACpopulator/models.py

STACpopulator/populator_base.py

fmigneault · 2023-10-10T16:41:32Z

STACpopulator/stac_utils.py

+media_types = {
+    "httpserver_service": "application/x-netcdf",
+    "opendap_service": pystac.MediaType.HTML,
+    "wcs_service": pystac.MediaType.XML,
+    "wms_service": pystac.MediaType.XML,
+    "nccs_service": "application/x-netcdf",
+    "HTTPServer": "application/x-netcdf",
+    "OPENDAP": pystac.MediaType.HTML,
+    "NCML": pystac.MediaType.XML,
+    "WCS": pystac.MediaType.XML,
+    "ISO": pystac.MediaType.XML,
+    "WMS": pystac.MediaType.XML,
+    "NetcdfSubset": "application/x-netcdf",
+}
+
+asset_roles = {
+    "httpserver_service": ["data"],
+    "opendap_service": ["data"],
+    "wcs_service": ["data"],
+    "wms_service": ["visual"],
+    "nccs_service": ["data"],
+    "HTTPServer": ["data"],
+    "OPENDAP": ["data"],
+    "NCML": ["metadata"],
+    "WCS": ["data"],
+    "ISO": ["metadata"],
+    "WMS": ["visual"],
+    "NetcdfSubset": ["data"],
+}


These look like THREDDS-specific metadata. They should most probably be attributes within its implementation.

fmigneault · 2023-10-10T16:47:59Z

STACpopulator/models.py

+# class Link(BaseModel):
+#     """
+#     https://github.com/radiantearth/stac-spec/blob/v1.0.0/collection-spec/collection-spec.md#link-object
+#     """
+
+#     href: str = Field(..., alias="href", min_length=1)
+#     rel: str = Field(..., alias="rel", min_length=1)
+#     type: Optional[str] = None
+#     title: Optional[str] = None
+#     # Label extension
+#     label: Optional[str] = Field(None, alias="label:assets")
+#     model_config = ConfigDict(use_enum_values=True)
+
+#     def resolve(self, base_url: str) -> None:
+#         """resolve a link to the given base URL"""
+#         self.href = urljoin(base_url, self.href)
+
+
+# class PaginationLink(Link):
+#     """
+#     https://github.com/radiantearth/stac-api-spec/blob/master/api-spec.md#paging-extension
+#     """
+
+#     rel: Literal["next", "previous"]
+#     method: Literal["GET", "POST"]
+#     body: Optional[Dict[Any, Any]] = None
+#     merge: bool = False
+
+
+# Links = RootModel[List[Union[PaginationLink, Link]]]


Not sure if this gets lost after conversion?

At the very least, the converted STAC Item should contain the rel: source with href point at the original NetCDF file and the rel: describes with href pointing at the NCML. In the case of rel: source, it is also very important that title contains the "path" of the file without the THREDDS URL prefix and the "service" path element (dodsC, fileServer, etc.).

Sorry, I didn't quite understand your point. The access URLs are contained as assets to the item. This is code that David had but it didn't look like it was being used, so I left it there but commented it out.

@huard would you like to pitch in here with a description of what you were doing and if that was complete? Also, how we can include Francis's suggestion?

When the STAC Item is created (POST'd to the API), some resources in Magpie are generated under the stac service to manage its permissions.
Since that item refers to a NetCDF under THREDDS, a corresponding resource exists under thredds service to manage its permissions as well.
Using bird-house/birdhouse-deploy#386, the intent is that when a permission is updated on one, it will synchronize with the other, such that viewing the NetCDF metadata from either service is allowed/denied in the same way for a given user or group.

The problem in doing the above is that the URLs of these 2 services are not "compatible".
The generated resources would be nested like so (for simplicity, I omit the nested-directory handling below) :

stac collections <directory-uuid> items <another-uuid> thredds <directory-name> <netcdf-name>

Since one works with UUIDs and the other with path names, there is no direct way to guess the references only from the URLs. Therefore, the STAC hook will use one of the references in links, namely the rel: source one, to obtain the STAC Asset URL of the original THREDDS NetCDF location. To establish the "link" between the stac and thredds Magpie services, it will use the title value from the rel: source link to establish a resource_display_name in Magpie. That parameter will be readable from Cowbird later on to resolve the stac UUID <-> thredds name resource relationship.

For example, the following catalog reference:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/xclim/cmip6/catalog.html?dataset=birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc
Should have (at least) the following link in the STAC Item

{ "rel": "source", "title": "birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc", "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc", "type": "application/x-netcdf" }

Thanks for that explanation @fmigneault. This makes sense. Let me implement this.

Okay done. If all looks good then I will complete the merge.

huard · 2023-10-10T16:55:23Z

@dchandan I realize based on your comments that I may have not communicated properly the fact that the CMIP6 extension was meant to replace the Datamodel logic, not duplicate it. I just didn't remove yet the logic in add_CMIP6.py while I was still experimenting with extensions.

So extensions might be a bad idea, but not because it replicates existing code (points 2.1 and 2.4).

Also, I did not have to define a schema to get it to work, so not sure about this being an additional complication. If needed, pydantic can spit out the schema for us.

Basically, embedding the CMIP properties in an extension is just a way to formalize how we're adding CMIP properties into the STAC item, so that we're using the same mechanism to "extend" STAC items (datacube, cmip6, and eventually maybe other extensions).

huard

I'm getting failures regarding the datacube extension.

STACpopulator/input.py

… comments

dchandan · 2023-10-12T16:16:54Z

@huard Thanks for the clarification!

I think that the scheme URI is required. In your code it was specified in SCHEMA_URI and used later by the extension via the class method get_schema_uri. Since that specific URI does not work (I get a 404), I am inclined to think that there is no validation performed using the schema, or even a check to see if the schema exists, so perhaps in that sense maybe a working schema is not required. But I think, if one is using extensions then one should have a properly constructed schema.

You're right about Pydantic's ability to spit out a schema coming in handy for this task and thereby reducing the barrier. But if one has to create a pydantic model anyways to generate a schema then I don't see why not just use that model rather than create another extension and then call on the pydantic model to generate the schema for use in the extension. At least in this particular case I don't see the point since the extension would not be doing anything more logically sophisticated than what a pydantic model does. Maybe in some other case, for a different type of data, one might want an extension to incorporate additional logic. But that decision can, and in my opinion, should be case by case.

(As a side note, it is my opinion that the way PySTAC has implemented STAC extensions is very non-pythonic which makes working with extensions a very unsatisfactory experience.)

dchandan · 2023-10-12T16:18:25Z

I'm getting failures regarding the datacube extension.

Yeah, that extension still has to be debuged. It was not working for me from the start, and it is likely that a couple things might have broken during the rearrangement.

huard · 2023-10-12T18:45:52Z

I tend to agree with your assessment, but being unfamiliar with STAC, I thought it was more prudent to architect stuff using existing pathways. In practice, maybe it doesn't change anything, I'm just not confident I know the answer to this question.

In any case, there are a couple of points that could be streamlined in a future PR.

Serializing numpy types. I suspect this could probably de done somewhere in xncml.
Setting the serialization_alias for every property seems overly verbose. I think we could just export the dict and then modify the keys to add the prefix before ingestion into the STAC item.

dchandan · 2023-10-12T20:24:43Z

Serializing numpy types. I suspect this could probably de done somewhere in xncml.

I agree.

Setting the serialization_alias for every property seems overly verbose. I think we could just export the dict and then modify the keys to add the prefix before ingestion into the STAC item.

I have pushed a change with a new way in which the prefix is applied. Check it out and let me know your thoughts.

huard · 2023-10-13T12:16:58Z

Looks good !

dchandan · 2023-10-13T14:12:39Z

All typing dict or list should be more specific types with dict[<key>, <items>] and list[<item>] notation.

All changes done, except in the implementations/CMIP6-UofT/extensions.py file which can be done in another PR related to making the Datacube extension work.

dchandan · 2023-10-13T14:18:22Z

All code in implementations should be moved under STACpopulator so that they can be imported externally.
Right now, they are not installed in the same module, which makes it unusable by other code base that could reuse the implementations (eg: https://github.com/crim-ca/ncml2stac/blob/main/notebooks/ncml2stac.ipynb).
For this notebook, I am particularly interested in avoiding to duplicate the logic in CMIP6populator.create_stac_item.

Done

dchandan · 2023-10-20T00:54:34Z

Also, I would rather have all the parsing logic in the stac-populator, this way we can be sure the result is identical on both ends.

I don't know what both ends means. I don't follow your use case. The NCML to STAC conversion is a workflow that happens on the server side initiated by the admin to populate the STAC database. That code here does that job properly (or are you saying that there is a deficiency in the STAC representation it creates, which would be useful to know).

We can chat about what exactly you are trying to do tomorrow over a zoom call.

dchandan · 2023-10-20T14:23:51Z

I think I've figured out what's going on. The issue is with the idiosyncrasies of how the THREDDS server returns metadata based on how the metadata is queried. We can discuss this on the call today.

dchandan · 2023-10-24T14:44:59Z

@fmigneault I've made changes to allow the kinds of workflow we discussed last Friday. Please check it out. The creation of a single STAC item can be tested using tests/test_standalone_stac_item.py.

tests/ref.txt

fmigneault · 2023-10-24T15:27:32Z

tests/ref.txt

+    359.99493408203125,
+    89.74176788330078
+  ],
+  "stac_extensions": []


Still missing the cmip6 extension.

We are not using cmip6 extension. At least, in the UofT implementation, I am not going to use cmip6 extension.

I don't understand?
This is where all the cmip6: prefixed attributes are coming from.
This is how STAC Items represent extensions and their corresponding schemas for validation.

Using cmip6: without the extension essentially violates how STAC Items are represented.

What exactly is stoping us from using https://github.com/TomAugspurger/cmip6?
I see a very strong overlap with all proposed properties.

Also, the https://github.com/stac-extensions/cf extension should be considered as well for any other parameter that would not be supported by the cmip6 extension.

I've already written about this above. I would recommend taking a look at those arguments above. Adding extensions, when not strictly necessary, adds complexity that makes the implementation a bad example to others for how to "easily" add items to STAC. There is nothing preventing another implementation for using an extension, I simply want to have an example that does not need to use extensions for a user in future who might not want to use extensions.

Using cmip6: without the extension essentially violates how STAC Items are represented.

Not sure what you mean here. As far as I can tell, there is nothing that requires properties with a prefix style to have associated extension. I've seen APIs that do not. Are you saying that as far as you know, doing this breaks something?

It is part of the recommendations of STAC specification to use existing extensions for reusability and documentation. It is the main purpose of STAC to make content accessible in a standardized manner. Since both cmip6 and cf extensions exist, they should be used instead of custom definitions. Using custom properties without any referenced extension and undefined by any JSON-spec that can validate them does not make it easier for users/web-clients to find the attributes, it complicates it on the STAC API side. STAC are even working with OGC to align redundant standards and reduce custom implementations essentially documentating the same data. Using un-prefixed properties will not "break" requirements, but it will strongly go against the STAC phylosophy. It is also better to guide users in employing recommended "best practices" instead of using workarounds.

I would like to stress that this implementation is not only for UoT or a demo showcase. This exact implementation would be used by CRIM and Ouranos to integrate it in actual instances. Therefore, I would prefer that we all streamline the STAC generation approach with a common strategy across all organizations.

fmigneault · 2023-10-24T15:27:51Z

tests/ref.txt

+      "rel": "source",
+      "href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc",
+      "type": "application/x-netcdf",
+      "title": "birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"


Still missing the thredds: prefix.

Just to be clear, here thredds: refers to the specific service name from https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/. If another THREDDS service was hosted as https://pavics.ouranos.ca/twitcher/ows/proxy/custom-thredds/, the title would have custom-thredds:path/to/file.nc instead.

fmigneault · 2023-10-24T15:30:19Z

tests/test_standalone_stac_item.py

+def test_standalone_stac_item():
+    url = (
+        "https://pavics.ouranos.ca/twitcher/ows/proxy/"
+        "thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
+        "?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2F"
+        "thredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html"
+        "&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
+    )


Should use @pytest.mark.parametrize and test for multiple variations of this URL (combinations with thredds/ncml/, thredds/catalog/, .xml, .html, with/without the catalog query parameter).

In each case, it should always result in the same STAC Item.

fmigneault · 2023-10-24T15:32:07Z

STACpopulator/populator_base.py

@@ -87,15 +71,26 @@ def item_properties_model(self):
        models.STACItemProperties."""
        pass

+    @property
+    @abstractmethod
+    def item_geometry_model(self):


missing typing

fmigneault · 2023-10-24T15:37:22Z

STACpopulator/populator_base.py

+        pass
+
+    @abstractmethod
+    def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
+        pass


For cases were these method/properties are always called by the class, they should do raise NotImplementedError.
Using pass will make those calls silently pass and yield another error much later that can be harder to analyse.

Not true. If an abstract method is not defined always a TypeError is raised. Does not matter at all what is the content of the abstract method in the abstract class.

You're right for the typical case of calling STACpopulatorBase() directly.
However, it doesn't behave the same depending on how the ABC gets called.

import abc class BaseABC(abc.ABC): @abc.abstractmethod def method(self): pass class BaseRaise(object): # doesn't "require" ABC, but can be added as extra validation @abc.abstractmethod def method(self): raise NotImplementedError class DeriveABC(BaseABC): def method(self): print("DeriveABC.method") class DeriveRaise(BaseRaise): def method(self): print("DeriveABC.method") BaseABC.method(DeriveABC()) # nothing printed, no error, the abstract method was called BaseRaise.method(DeriveRaise()) # error below

Traceback (most recent call last): File "/home/francis/dev/conda/envs/weaver/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-9-f234a81a6388>", line 1, in <module> BaseRaise.method(DeriveRaise()) File "<ipython-input-5-d3d4d159a3cb>", line 4, in method raise NotImplementedError NotImplementedError

Sometimes, the BaseABC.method(DeriveABC()) format is needed to handle specific MRO conditions (such as when I was overriding the classes in my notebook). Using only pass, it would not flag that the abstract class was called directly, probably not something that was done intentionnaly.

Anyway.
I believe raise NotImplementedError is much more indicative of the intension "you must implement this", and if somehow ABC is omitted, it will still report the error.

Okay interesting use case..... alright, will change to raising error.

fmigneault · 2023-10-24T15:48:38Z

tests/test_standalone_stac_item.py

+    attrs = xncml.Dataset.from_text(requests.get(url).content).to_cf_dict()
+    stac_item_id = make_cmip6_item_id(attrs["attributes"])
+    stac_item = STAC_item_from_metadata(stac_item_id, attrs, CMIP6ItemProperties, GeoJSONPolygon)


There is still a strong relationship between those components.
STAC_item_from_metadata can only parse attributes if they respect exactly the format returned by xncml.Datase.to_cf_dict (i.e.: cfmeta = attrs["groups"]["CFMetadata"]["attributes"]).

The same applies for make_cmip6_item_id that expect a very specific set of attributes that come from xncml.Datase.to_cf_dict.

All this block should be a single CMIP6populator.get_stac_item method that takes the URL, and does everything necessary to obtain the STAC Item without any futher intervention.

Both make_cmip6_item_id and STAC_item_from_metadata should be converted to methods under CMIP6populator. There is no reason for them to exist elsewhere given how they heavily depend on the specific formats that CMIP6populator.item_properties_model and CMIP6populator.item_geometry_model defines.

Both make_cmip6_item_id and STAC_item_from_metadata should be converted to methods under CMIP6populator. There is no reason for them to exist elsewhere given how they heavily depend on the specific formats that CMIP6populator.item_properties_model and CMIP6populator.item_geometry_model defines.

Interesting suggestion... I can see what you mean. make_cmip6_item_id is easy, but STAC_item_from_metadata is tricky. I don't necessarily want to move most of that logic into CMIP6populator as there is a lot there that is common to making any STAC item. But maybe one will have to.... Let me chew on this. It might be that we keep this for a later.

I agree. Some part of the logic should be under STACPopulatorBase.stac_item_from_metadata, since it applies to any STAC Item generation case (for example, handling the extents, date-time, and other generic metadata).
The CMIP6populator.stac_item_from_metadata specificatlly for converting CF attributes to STAC CMIP6 extension would then do something like:

class CMIP6populator(STACPopulatorBase): def stac_item_from_metadata(self, item_data): stac_base = super().stac_item_from_metadata(item_data) stac_cmip6 = # specific populator logic for CF... stac_item = # merge stac_base + stac_cmip6 return stac_item

dchandan · 2023-10-24T16:12:58Z

You had commented earlier that:

The steps used to generate the STAC Item were simply:

ds = xncml.Dataset(input_ncml)
input_ncml_data = ds.to_cf_dict()
stac_item_id = make_cmip6_item_id(attrs["attributes"])
attrs["id"] = stac_item_id
stac_item = CFJsonItemNetCDF(stac_item_id, attrs, cmip6.Properties)

This does exactly that. I am not following the objection now....

fmigneault · 2023-10-24T22:50:41Z

This does exactly that. I am not following the objection now....

Indeed.
But as we discussed in our 1-on-1 call, I would rather have a single operation without any specific logic related to generating the attributes from the NCML and for the STAC Item. What the notebook did before was not "better", just easier with less overrides, but it was still a temporary implementation until the logic is entirely self-contained by stac-populator.

dchandan · 2023-10-26T17:33:23Z

This does exactly that. I am not following the objection now....

Indeed. But as we discussed in our 1-on-1 call, I would rather have a single operation without any specific logic related to generating the attributes from the NCML and for the STAC Item. What the notebook did before was not "better", just easier with less overrides, but it was still a temporary implementation until the logic is entirely self-contained by stac-populator.

Okay. I will leave it for a later refactoring of code. Right now, I need to get this thing working and deploy it for our use.

dchandan · 2023-10-26T21:50:57Z

@fmigneault Any idea why this "https://psl.noaa.gov/thredds/ncml/Datasets/NARR/Dailies/pressure/air.197901.nc?catalog=http://psl.noaa.gov/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.html&dataset=Datasets/NARR/Dailies/pressure/air.197901.nc" catalog does not contain the same information as "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"?

Both have query parameters, but they behave differently. While the Ouranos one gives service URLs, the other one does not.

fmigneault · 2023-10-26T22:42:51Z

Maybe it is the versions?
https://psl.noaa.gov/thredds/catalog/catalog.xml reports version="1.2" https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog.xml reports version="1.0.1"`

Otherwise, maybe the contents of the NetCDF themselves differ.
I am not familiar enough with the metdata those files contain.
@huard ?

dchandan · 2023-10-26T23:06:46Z

Maybe it is a version thing.

That complicates things. I can't use the query parameter based link anymore because different servers may behave differently. I will revert back to using access links by opening the catalog itself using siphon. My tests show that way works with both of these servers.

fmigneault · 2023-10-27T00:53:31Z

Check if you can get the same results by overriding siphon.catalog.TDSCatalog._process_dataset to a no-op method, or something that limits how much crawling it does on __init__.

If the THREDDS server has a lot of datasets, it starts parsing everything right away, which is quickly inefficient if the purpose is only to describe a single NCML file or a subset of a specific dataset.
I tried with siphon.catalog.TDSCatalog("psl.noaa.gov/thredds/catalog/catalog.xml") and it takes some time to load 3759 datasets...

Even using siphon.catalog.TDSCatalog, I think you will need the query parameter because the URL of the NCML itself ("pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc) does not resolve the reference of its parent dataset.

huard · 2023-10-27T13:26:59Z

So I don't understand why there are differences with and without query parameters. I found out empirically by inspecting the URL of the NcML service exposed in the catalog web page, and have been using it since because the response contains more information.

dchandan · 2023-11-08T21:19:25Z

Closed as this was getting too long. Will revisit outstanding issues in future PRs.

dchandan added 7 commits October 4, 2023 22:03

adding numpy types to python types conversion for metadata

71ce3e0

removing collection2enum

350b4f4

black

2a445f0

extracting pydantic base models to models.py

2728ce6

removing cmip6 extension code

b47d613

Breaking CFJsonItem part 1: extracting STAC item creation

2f5dc39

Breaking CFJsonItem part 2: extracting datacube extension code

3f821ce

dchandan changed the base branch from master to arch-changes October 6, 2023 18:52

dchandan requested review from huard and Nazim-crim October 10, 2023 03:20

fmigneault reviewed Oct 10, 2023

View reviewed changes

huard reviewed Oct 10, 2023

View reviewed changes

STACpopulator/input.py Outdated Show resolved Hide resolved

dchandan added 4 commits October 12, 2023 11:03

updating geometry structure

3c584cc

moving np datatype conversion to a separate function

b7a7ed9

modifications to datacube extension helper functions as per Francis's…

48598ae

… comments

code cleanup

94eb521

change how prefix is applied

a64a226

PR changes

f22c1a2

dchandan added 2 commits October 17, 2023 12:07

fixing output media type and roles output for assets

efd9230

adding magpie resource link

3e88591

dchandan added 3 commits October 23, 2023 12:00

diagnostics

25985db

removing unused code

6d675bc

refactoring to allow more flexibility

65bd5bb

fmigneault requested changes Oct 24, 2023

View reviewed changes

fix datacube extension

f540dbe

dchandan added 2 commits October 26, 2023 20:12

pr changes

323c945

reverting to old way to read thredds access links

0581c61

dchandan added 6 commits November 8, 2023 16:11

adding ability to get single file from THREDDS loader

37a26e1

making make_cmip6_item_id a staticmethod

e55591d

wrapping call to make STAC item with a try-exepcet block

f1e28db

fixing commit e55591d

8bb21e1

more fixes to previous commits

3055afc

making tracking_id optional in CMIP6ItemProperties

3f1d284

dchandan merged commit 26fe4ad into arch-changes Nov 8, 2023
1 check failed

fmigneault mentioned this pull request Nov 10, 2023

Added link to the THREDDS Catalog to the Collection object. #23

Closed

dchandan mentioned this pull request Nov 16, 2023

Clarify mechanism to get asset links #34

Open

fmigneault mentioned this pull request Nov 22, 2023

refactor STAC extensions and validation outside of populator logic #38

Merged

dchandan deleted the arch-finalization-proposal branch January 17, 2024 16:35

Arch finalization proposal #25

Arch finalization proposal #25

Conversation

dchandan commented Oct 6, 2023 • edited Loading

Key changes:

dchandan commented Oct 6, 2023

fmigneault commented Oct 10, 2023

fmigneault left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmigneault Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huard commented Oct 10, 2023 • edited Loading

huard left a comment

Choose a reason for hiding this comment

dchandan commented Oct 12, 2023

dchandan commented Oct 12, 2023

huard commented Oct 12, 2023

dchandan commented Oct 12, 2023

huard commented Oct 13, 2023

dchandan commented Oct 13, 2023

dchandan commented Oct 13, 2023

dchandan commented Oct 20, 2023

dchandan commented Oct 20, 2023

dchandan commented Oct 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmigneault Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmigneault Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchandan commented Oct 24, 2023 • edited Loading

fmigneault commented Oct 24, 2023 • edited Loading

dchandan commented Oct 26, 2023

dchandan commented Oct 26, 2023

fmigneault commented Oct 26, 2023

dchandan commented Oct 26, 2023

fmigneault commented Oct 27, 2023

huard commented Oct 27, 2023

dchandan commented Nov 8, 2023

dchandan commented Oct 6, 2023 •

edited

Loading

fmigneault Oct 13, 2023 •

edited

Loading

huard commented Oct 10, 2023 •

edited

Loading

fmigneault Oct 26, 2023 •

edited

Loading

fmigneault Oct 24, 2023 •

edited

Loading

dchandan commented Oct 24, 2023 •

edited

Loading

fmigneault commented Oct 24, 2023 •

edited

Loading