-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arch finalization proposal #25
Conversation
To do: Datamodel extension wasn't working properly from the very start. We still have to get it to work. |
@dchandan All code in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All typing dict
or list
should be more specific types with dict[<key>, <items>]
and list[<item>]
notation.
STACpopulator/stac_utils.py
Outdated
media_types = { | ||
"httpserver_service": "application/x-netcdf", | ||
"opendap_service": pystac.MediaType.HTML, | ||
"wcs_service": pystac.MediaType.XML, | ||
"wms_service": pystac.MediaType.XML, | ||
"nccs_service": "application/x-netcdf", | ||
"HTTPServer": "application/x-netcdf", | ||
"OPENDAP": pystac.MediaType.HTML, | ||
"NCML": pystac.MediaType.XML, | ||
"WCS": pystac.MediaType.XML, | ||
"ISO": pystac.MediaType.XML, | ||
"WMS": pystac.MediaType.XML, | ||
"NetcdfSubset": "application/x-netcdf", | ||
} | ||
|
||
asset_roles = { | ||
"httpserver_service": ["data"], | ||
"opendap_service": ["data"], | ||
"wcs_service": ["data"], | ||
"wms_service": ["visual"], | ||
"nccs_service": ["data"], | ||
"HTTPServer": ["data"], | ||
"OPENDAP": ["data"], | ||
"NCML": ["metadata"], | ||
"WCS": ["data"], | ||
"ISO": ["metadata"], | ||
"WMS": ["visual"], | ||
"NetcdfSubset": ["data"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like THREDDS-specific metadata. They should most probably be attributes within its implementation.
# class Link(BaseModel): | ||
# """ | ||
# https://github.com/radiantearth/stac-spec/blob/v1.0.0/collection-spec/collection-spec.md#link-object | ||
# """ | ||
|
||
# href: str = Field(..., alias="href", min_length=1) | ||
# rel: str = Field(..., alias="rel", min_length=1) | ||
# type: Optional[str] = None | ||
# title: Optional[str] = None | ||
# # Label extension | ||
# label: Optional[str] = Field(None, alias="label:assets") | ||
# model_config = ConfigDict(use_enum_values=True) | ||
|
||
# def resolve(self, base_url: str) -> None: | ||
# """resolve a link to the given base URL""" | ||
# self.href = urljoin(base_url, self.href) | ||
|
||
|
||
# class PaginationLink(Link): | ||
# """ | ||
# https://github.com/radiantearth/stac-api-spec/blob/master/api-spec.md#paging-extension | ||
# """ | ||
|
||
# rel: Literal["next", "previous"] | ||
# method: Literal["GET", "POST"] | ||
# body: Optional[Dict[Any, Any]] = None | ||
# merge: bool = False | ||
|
||
|
||
# Links = RootModel[List[Union[PaginationLink, Link]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this gets lost after conversion?
At the very least, the converted STAC Item should contain the rel: source
with href
point at the original NetCDF file and the rel: describes
with href
pointing at the NCML. In the case of rel: source
, it is also very important that title
contains the "path" of the file without the THREDDS URL prefix and the "service" path element (dodsC, fileServer, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't quite understand your point. The access URLs are contained as assets to the item. This is code that David had but it didn't look like it was being used, so I left it there but commented it out.
@huard would you like to pitch in here with a description of what you were doing and if that was complete? Also, how we can include Francis's suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the STAC Item is created (POST'd to the API), some resources in Magpie are generated under the stac
service to manage its permissions.
Since that item refers to a NetCDF under THREDDS, a corresponding resource exists under thredds
service to manage its permissions as well.
Using bird-house/birdhouse-deploy#386, the intent is that when a permission is updated on one, it will synchronize with the other, such that viewing the NetCDF metadata from either service is allowed/denied in the same way for a given user or group.
The problem in doing the above is that the URLs of these 2 services are not "compatible".
The generated resources would be nested like so (for simplicity, I omit the nested-directory handling below) :
stac
collections
<directory-uuid>
items
<another-uuid>
thredds
<directory-name>
<netcdf-name>
Since one works with UUIDs and the other with path names, there is no direct way to guess the references only from the URLs. Therefore, the STAC hook will use one of the references in links
, namely the rel: source
one, to obtain the STAC Asset URL of the original THREDDS NetCDF location. To establish the "link" between the stac
and thredds
Magpie services, it will use the title
value from the rel: source
link to establish a resource_display_name
in Magpie. That parameter will be readable from Cowbird later on to resolve the stac
UUID <-> thredds
name resource relationship.
For example, the following catalog reference:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/xclim/cmip6/catalog.html?dataset=birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc
Should have (at least) the following link in the STAC Item
{
"rel": "source",
"title": "birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc",
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc",
"type": "application/x-netcdf"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that explanation @fmigneault. This makes sense. Let me implement this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay done. If all looks good then I will complete the merge.
@dchandan I realize based on your comments that I may have not communicated properly the fact that the CMIP6 extension was meant to replace the Datamodel logic, not duplicate it. I just didn't remove yet the logic in add_CMIP6.py while I was still experimenting with extensions. So extensions might be a bad idea, but not because it replicates existing code (points 2.1 and 2.4). Also, I did not have to define a schema to get it to work, so not sure about this being an additional complication. If needed, pydantic can spit out the schema for us. Basically, embedding the CMIP properties in an extension is just a way to formalize how we're adding CMIP properties into the STAC item, so that we're using the same mechanism to "extend" STAC items (datacube, cmip6, and eventually maybe other extensions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting failures regarding the datacube extension.
@huard Thanks for the clarification! I think that the scheme URI is required. In your code it was specified in You're right about Pydantic's ability to spit out a schema coming in handy for this task and thereby reducing the barrier. But if one has to create a pydantic model anyways to generate a schema then I don't see why not just use that model rather than create another extension and then call on the pydantic model to generate the schema for use in the extension. At least in this particular case I don't see the point since the extension would not be doing anything more logically sophisticated than what a pydantic model does. Maybe in some other case, for a different type of data, one might want an extension to incorporate additional logic. But that decision can, and in my opinion, should be case by case. (As a side note, it is my opinion that the way PySTAC has implemented STAC extensions is very non-pythonic which makes working with extensions a very unsatisfactory experience.) |
Yeah, that extension still has to be debuged. It was not working for me from the start, and it is likely that a couple things might have broken during the rearrangement. |
I tend to agree with your assessment, but being unfamiliar with STAC, I thought it was more prudent to architect stuff using existing pathways. In practice, maybe it doesn't change anything, I'm just not confident I know the answer to this question. In any case, there are a couple of points that could be streamlined in a future PR.
|
I agree.
I have pushed a change with a new way in which the prefix is applied. Check it out and let me know your thoughts. |
Looks good ! |
All changes done, except in the |
Done |
I don't know what We can chat about what exactly you are trying to do tomorrow over a zoom call. |
I think I've figured out what's going on. The issue is with the idiosyncrasies of how the THREDDS server returns metadata based on how the metadata is queried. We can discuss this on the call today. |
@fmigneault I've made changes to allow the kinds of workflow we discussed last Friday. Please check it out. The creation of a single STAC item can be tested using |
tests/ref.txt
Outdated
359.99493408203125, | ||
89.74176788330078 | ||
], | ||
"stac_extensions": [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing the cmip6
extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not using cmip6 extension. At least, in the UofT implementation, I am not going to use cmip6 extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand?
This is where all the cmip6:
prefixed attributes are coming from.
This is how STAC Items represent extensions and their corresponding schemas for validation.
Using cmip6:
without the extension essentially violates how STAC Items are represented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is stoping us from using https://github.com/TomAugspurger/cmip6?
I see a very strong overlap with all proposed properties.
Also, the https://github.com/stac-extensions/cf extension should be considered as well for any other parameter that would not be supported by the cmip6
extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already written about this above. I would recommend taking a look at those arguments above. Adding extensions, when not strictly necessary, adds complexity that makes the implementation a bad example to others for how to "easily" add items to STAC. There is nothing preventing another implementation for using an extension, I simply want to have an example that does not need to use extensions for a user in future who might not want to use extensions.
Using cmip6: without the extension essentially violates how STAC Items are represented.
Not sure what you mean here. As far as I can tell, there is nothing that requires properties with a prefix style to have associated extension. I've seen APIs that do not. Are you saying that as far as you know, doing this breaks something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of the recommendations of STAC specification to use existing extensions for reusability and documentation. It is the main purpose of STAC to make content accessible in a standardized manner. Since both cmip6
and cf
extensions exist, they should be used instead of custom definitions. Using custom properties without any referenced extension and undefined by any JSON-spec that can validate them does not make it easier for users/web-clients to find the attributes, it complicates it on the STAC API side. STAC are even working with OGC to align redundant standards and reduce custom implementations essentially documentating the same data. Using un-prefixed properties will not "break" requirements, but it will strongly go against the STAC phylosophy. It is also better to guide users in employing recommended "best practices" instead of using workarounds.
I would like to stress that this implementation is not only for UoT or a demo showcase. This exact implementation would be used by CRIM and Ouranos to integrate it in actual instances. Therefore, I would prefer that we all streamline the STAC generation approach with a common strategy across all organizations.
tests/ref.txt
Outdated
"rel": "source", | ||
"href": "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc", | ||
"type": "application/x-netcdf", | ||
"title": "birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing the thredds:
prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, here thredds:
refers to the specific service name from https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/
. If another THREDDS service was hosted as https://pavics.ouranos.ca/twitcher/ows/proxy/custom-thredds/
, the title would have custom-thredds:path/to/file.nc
instead.
def test_standalone_stac_item(): | ||
url = ( | ||
"https://pavics.ouranos.ca/twitcher/ows/proxy/" | ||
"thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc" | ||
"?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2F" | ||
"thredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html" | ||
"&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use @pytest.mark.parametrize
and test for multiple variations of this URL (combinations with thredds/ncml/
, thredds/catalog/
, .xml
, .html
, with/without the catalog
query parameter).
In each case, it should always result in the same STAC Item.
@@ -87,15 +71,26 @@ def item_properties_model(self): | |||
models.STACItemProperties.""" | |||
pass | |||
|
|||
@property | |||
@abstractmethod | |||
def item_geometry_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing typing
STACpopulator/populator_base.py
Outdated
pass | ||
|
||
@abstractmethod | ||
def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases were these method/properties are always called by the class, they should do raise NotImplementedError
.
Using pass
will make those calls silently pass and yield another error much later that can be harder to analyse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not true. If an abstract method is not defined always a TypeError
is raised. Does not matter at all what is the content of the abstract method in the abstract class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right for the typical case of calling STACpopulatorBase()
directly.
However, it doesn't behave the same depending on how the ABC
gets called.
import abc
class BaseABC(abc.ABC):
@abc.abstractmethod
def method(self):
pass
class BaseRaise(object): # doesn't "require" ABC, but can be added as extra validation
@abc.abstractmethod
def method(self):
raise NotImplementedError
class DeriveABC(BaseABC):
def method(self):
print("DeriveABC.method")
class DeriveRaise(BaseRaise):
def method(self):
print("DeriveABC.method")
BaseABC.method(DeriveABC()) # nothing printed, no error, the abstract method was called
BaseRaise.method(DeriveRaise()) # error below
Traceback (most recent call last):
File "/home/francis/dev/conda/envs/weaver/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-f234a81a6388>", line 1, in <module>
BaseRaise.method(DeriveRaise())
File "<ipython-input-5-d3d4d159a3cb>", line 4, in method
raise NotImplementedError
NotImplementedError
Sometimes, the BaseABC.method(DeriveABC())
format is needed to handle specific MRO conditions (such as when I was overriding the classes in my notebook). Using only pass
, it would not flag that the abstract class was called directly, probably not something that was done intentionnaly.
Anyway.
I believe raise NotImplementedError
is much more indicative of the intension "you must implement this", and if somehow ABC
is omitted, it will still report the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay interesting use case..... alright, will change to raising error.
attrs = xncml.Dataset.from_text(requests.get(url).content).to_cf_dict() | ||
stac_item_id = make_cmip6_item_id(attrs["attributes"]) | ||
stac_item = STAC_item_from_metadata(stac_item_id, attrs, CMIP6ItemProperties, GeoJSONPolygon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a strong relationship between those components.
STAC_item_from_metadata
can only parse attributes if they respect exactly the format returned by xncml.Datase.to_cf_dict
(i.e.: cfmeta = attrs["groups"]["CFMetadata"]["attributes"]
).
The same applies for make_cmip6_item_id
that expect a very specific set of attributes that come from xncml.Datase.to_cf_dict
.
All this block should be a single CMIP6populator.get_stac_item
method that takes the URL, and does everything necessary to obtain the STAC Item without any futher intervention.
Both make_cmip6_item_id
and STAC_item_from_metadata
should be converted to methods under CMIP6populator
. There is no reason for them to exist elsewhere given how they heavily depend on the specific formats that CMIP6populator.item_properties_model
and CMIP6populator.item_geometry_model
defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
make_cmip6_item_id
andSTAC_item_from_metadata
should be converted to methods underCMIP6populator
. There is no reason for them to exist elsewhere given how they heavily depend on the specific formats thatCMIP6populator.item_properties_model
andCMIP6populator.item_geometry_model
defines.
Interesting suggestion... I can see what you mean. make_cmip6_item_id
is easy, but STAC_item_from_metadata
is tricky. I don't necessarily want to move most of that logic into CMIP6populator
as there is a lot there that is common to making any STAC item. But maybe one will have to.... Let me chew on this. It might be that we keep this for a later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Some part of the logic should be under STACPopulatorBase.stac_item_from_metadata
, since it applies to any STAC Item generation case (for example, handling the extents, date-time, and other generic metadata).
The CMIP6populator.stac_item_from_metadata
specificatlly for converting CF attributes to STAC CMIP6 extension would then do something like:
class CMIP6populator(STACPopulatorBase):
def stac_item_from_metadata(self, item_data):
stac_base = super().stac_item_from_metadata(item_data)
stac_cmip6 = # specific populator logic for CF...
stac_item = # merge stac_base + stac_cmip6
return stac_item
You had commented earlier that:
This does exactly that. I am not following the objection now.... |
Indeed. |
Okay. I will leave it for a later refactoring of code. Right now, I need to get this thing working and deploy it for our use. |
@fmigneault Any idea why this "https://psl.noaa.gov/thredds/ncml/Datasets/NARR/Dailies/pressure/air.197901.nc?catalog=http://psl.noaa.gov/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.html&dataset=Datasets/NARR/Dailies/pressure/air.197901.nc" catalog does not contain the same information as "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"? Both have query parameters, but they behave differently. While the Ouranos one gives service URLs, the other one does not. |
Maybe it is the versions? Otherwise, maybe the contents of the NetCDF themselves differ. |
Maybe it is a version thing. That complicates things. I can't use the query parameter based link anymore because different servers may behave differently. I will revert back to using access links by opening the catalog itself using siphon. My tests show that way works with both of these servers. |
Check if you can get the same results by overriding If the THREDDS server has a lot of datasets, it starts parsing everything right away, which is quickly inefficient if the purpose is only to describe a single NCML file or a subset of a specific dataset. Even using |
So I don't understand why there are differences with and without query parameters. I found out empirically by inspecting the URL of the NcML service exposed in the catalog web page, and have been using it since because the response contains more information. |
Closed as this was getting too long. Will revisit outstanding issues in future PRs. |
Here is my proposal for finalizing the architecture of STAC populator. I have chosen to branch from
arch-changes
rather than overlay my changes to that branch, so once we finalize this PR we can merge it back intoarch-changes
and close PR #16.There is currently only one implementation, for CMIP6 data residing on the UofT node. We expect more implementations to follow soon. Ouranos will probably want to have an implementation for CMIP6 data on the Ouranos node. I am also working on two other implemetations, one for the NEX-GDDP data and one that populates our catalog with data from other catalogs (this implemetation can probably be used verbatim by more than one Marble node.)
Key changes:
Extracting pydantic models into a separate file
I moved the Pydantic models from
stac_utils.py
to a newmodels.py
file.Removing cmip6 extension
After examination of the code I came to be of the opinion that the cmip6 extension was not adding much value and could be removed. Here's why:
create_stac_item
all appropriate CMIP6 properties were already added to thepystac.Item
instance from the Pydantic data model that describes the CMIP6 Item Properties (CMIP6ItemProperties
inadd_CMIP6.py
). Since the extension was simply adding properties to the Item that were already being added by the Pydantic model, in its form it was not adding any additional value.Breaking up
CFJsonItem
I felt that the
CFJsonItem
was doing two things that didn't quite fit together within one class. On the one hand it contained logic for the creation of a newpystac.Item
object from CF metadata and on the other hand there was logic that was specifically for use by thepystac.extensions.datacube
extension. The former is applicable to all items that are created based on crawling the THREDDS catalog, but thedatacube
extension code is not applicable to all STAC items because the extension itself is not applicable to all types of data (or maybe one doesn't want to use the extension even for data to which it is applicable). Therefore, having the extension logic in that class that gets instantiated for every catalog item didn't make sense to me.So, I dissolved
CFJsonItem
and extracted its logic as follows: code that pertains to the creation of STAC item (more specifically apystac.Item
class representing a STAC item) is in a new function instac_utils.py
calledSTAC_item_from_metadata
while code meant to support thedatacube
extension is now in a new fileextensions.py
in theCMIP6-UofT
implementations folder. I moved that code to the implementations folder rather than in the core section because I am unsure how generally applicable this code would be to other types of model output data, but this could be changed.