-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Item Provenance #179
Comments
The dataset group was talking about that, too. We have not really thought that out, but were thinking of something like that: Process graph extension (pg) - Items and Datasets
|
@omad This is great, I recently have been creating derived items and was doing something similar....adding in links from the items that were used. I'm not sure "type" is really all that necessary in the links though, the record that is linked to should provide that information, or at least not used in this way. I think "name" or "id" could be used to distinguish between the links, but "type" implies a specific type category but these have not been defined to include things like "surface-reflectance". |
I'm a strong advocate for standardizing provenance and expressing it as metadata, but I'm not convinced it belongs in STAC. By design, STAC should be limited to the subset of metadata that services data discovery. It seems unlikely that most clients will need to make provenance (at least detailed provenance) part of the discovery process. I suggest that this is within the purview of the ARD group. Perhaps they will define a detailed provenance model, and summary labels/levels of provenance claims. Such a provenance label would belong in STAC, but not the entire provenance description. |
I would like to use the processing chain/provenance standards, once they exist, within the Earth Engine catalog - not just for obvious provenance-displaying purposes (at both dataset and item levels), but also for error representation. Specifically, when a final EE asset is missing due to an error at one of the previous steps that I can track, I'd like to display it to the users (and eventually build some tools for summarizing/fixing the issues). Not sure whether this fits into ARD, STAC, or both - the upshot is that I would like to have a detailed provenance standard for operational purposes. |
I agree that in an ideal world someone else standardizes provenance, and we just incorporate it. Especially a detailed provenance model. But I do feel it's so important to get right as we move to the cloud, that I'd like to see us at least have a couple of tags that point the right way. And I think the ARD stuff is moving slower than we need it. I do agree with the measure of what's in STAC as 'is it part of the discovery process'. But I think being able to do a search of images derived_from a particular landsat scene is incredibly valuable, and something that has been missing in the geospatial world. I also think the 'links' section of STAC should be very welcoming, as good links will enable future crawlers to be a lot more effective. So I do think we should put the data production information in its own repo - draw that line to not be STAC (though happy if people from here collaborate on it, and it's a set of work I very much want to see). But I think the 'derived_from' STAC extension that describes the use of that link makes good sense. And I think I'd include the 'software' link in there too. @simonff and @omad - I can start a repo for you under RadiantEarth github if you want a neutral place to collaborate on processing chain stuff. |
I like the idea of a processing chain extension, but in the meantime I think a simple accepted way of defining parents (ie derived_from) in the links is a good way to start. Just having that ability to created derived products and easily be able to follow links to what items were used to generate it is extremely useful. |
I also would like to have derived_from added to the spec. Depending on the level of detail this could be an extension/profile or not. Just linking to derived datasets could probably be just be in the core, but if we add things like software etc. it sounds more suitable to make it an extension/profile. The process chain extension may be inspired by openEO as we already have some specification on how to describe process chains in JSON in general. Other than that, I think it's a good idea to get the provenance out of the STAC repo. It's just such an enormous effort that it should be spec'ed separately and eventually get adopted by STAC. |
I support all of the above - any steps we make toward improving provenance are great. |
Cool. 1 field added sounds good. I'm inclined to just put it in 'core', especially since it's not required, and it's no change to the json schema, etc. It's just a particular type of link that we mention as being useful. @hgs-msmith - does that sound ok to you? Or you want it in an extension? Or feel it shouldn't be in at all? Anyone up to make a PR? Would also be great if we can get a catalog that shows it, even just a mini-catalog sample. @omad - can you share your catalogs? Would be psyched to add one or two to our implementation list whenever you're ready, and hopefully they show this. Would be great for stac-browser to also display that link. cc @mojodna |
Elaborating on my earlier comment... Imagine for a moment that an ARD standard exists, and that as a result ARD defines many standard processing chains and that each has a unique identifier (let's call it an ARD-code). Because it's a standard, everyone knows (or can easily look up) the detailed definition of the processing chain via its identifier. For purposes of data discovery, I need only designate the ARD-code(s) as part of my search filter if I care about only receiving results that satisfy my provenance requirements. The obvious analogy is EPSG. If I care about the CRS and I want to define it as a search filter, I need only use the EPSG code(s) rather than define the entire projection/CRS in my search and in my STAC data records. This is what I meant by a "label for a provenance claim". Back to reality for minute - it will be quite some time before ARD can do this for us. In the meantime, if the group feels there is value in a "derived_from" tag, to support data discovery, I'm OK with it. I just don't want to see us go down the path of defining a whole schema for provenance within STAC. |
I think core is fine, it's not even a new field, it's just a type of rel link. I think in core we need to enumerate a list of possible rel links. It doesn't even need to be a complete list, but something for providers to go by so they don't just make it up. I agree with @hgs-msmith's sentiments overall, but for now could really use a standard rel value for creating links when creating derived items. |
Ah, that's a good point @matthewhanson - that list of possible rel link values was one of the things I was hoping for from the sprint, but focused people more on the asset types. Just made #191 for that. And sounds like we've reached good consensus. Definitely agree with @hgs-msmith's points and they're in line with how I was thinking about it. We put in the minimal amount of 'link' relationship in STAC, for data discovery. And hope for someone to do more robust stuff. And I like the idea of the ARD-code. PR welcome, or I'll try to get to it today... |
See #179 We likely should also create an extension that more fully spells out some of the ideas for this field, as well as additions to make the links more useful. Like explain that it could be used to link to non-STAC metadata, etc. But wanted to get this in so people could start to use it.
I came across this thread late as I'm just looking at STAC for a project we have in gestation. Coming from an ISO background I have to say I find the web first approach very refreshing! |
Thanks for the positive words @6footdestiny! I looked at w3c prov a bit and it looked quite promising. I do have questions about how widely used it is, do you have any insight on that? It seems to be a 'group note' status from 2013, and hasn't advanced to 'recommendation'? We definitely do not want to re-invent a full provenance model, and do not see that in scope for STAC. But we wanted to emphasize that it was important to us, so just added the single field, keeping it as minimal as possible. Ideally we find / collaborate on a good provenance model and are able to just reference it's use, and replace derived_from with something from there. What I'd love to see is a STAC extension that shows how to use the w3c prov model in JSON with common geospatial constructs. Indeed a lot of the interest from the STAC group is how to represent programmatic processing (generally on cloud environments) as part of the processing, and to track that provenance. From what I've seen w3c prov examples seem more about people / scientific provenance, though it does seem flexible enough for machine stuff. But I think there are lots of little quirks of geospatial processing, so some fleshing out of how to exactly use the w3c prov for STAC seems necessary. I imagine before long we'll have a sub-group of interested STAC collaborators take on provenance, and they'll look at all the use cases and the candidates out there, and I definitely think w3c prov is the top of the list. |
@cholmes I don't have any specific intel on the adoption rate of Prov although I know from colleagues in the UK academic sector that were involved in it that there is still active repo updates for software implementation (python) (see https://www.software.ac.uk/who-do-we-work/provenance-tool-suite). |
Background
Many applications wish to be able to track detailed information about the source data and production algorithms which have been used to produce the data in this Item.
We need to balance this requirement between keeping it simple while still having enough details to be useful.
Tracking Data Sources
A simple way to track source data would be to include additional links in the existing
links
section ofItems
, using newrel
attributes, and potentially newtype
attributes.New
rel
names:derived_from
This would provide a link to the exact
Item
or set ofItems
that this used as input data in it's production.Extras:
type
attribute on the link. In HTML this is typically a MIME type, but we could simply use descriptive text. We could also use a different attribute name.software
A link to a Git repository or other useful link to software used when producing this data.
Options:
href
.Example
Storing detailed information about data production (Optional Extension)
In some cases even more detailed information might be desirable. This is a quick placeholder example, but needs a lot more work before being included as a STAC Extension.
The text was updated successfully, but these errors were encountered: