Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Item Provenance #179

Closed
omad opened this issue Aug 15, 2018 · 15 comments
Closed

Tracking Item Provenance #179

omad opened this issue Aug 15, 2018 · 15 comments
Labels
minor a relatively small change to the spec prio: must-have required for release associated with
Milestone

Comments

@omad
Copy link

omad commented Aug 15, 2018

Background

Many applications wish to be able to track detailed information about the source data and production algorithms which have been used to produce the data in this Item.

We need to balance this requirement between keeping it simple while still having enough details to be useful.

Tracking Data Sources

A simple way to track source data would be to include additional links in the existing links section of Items, using new rel attributes, and potentially new type attributes.

New rel names:

  • derived_from
    This would provide a link to the exact Item or set of Items that this used as input data in it's production.

    Extras:

    • It can be useful to indicate a bit more information about what the source item is, or how it was used. One way of doing this would be adding a type attribute on the link. In HTML this is typically a MIME type, but we could simply use descriptive text. We could also use a different attribute name.
    • Do we want to provide hints about whether we're linking to STAC Items (Ideal!) or some other type or resource?
  • software
    A link to a Git repository or other useful link to software used when producing this data.

    Options:

    • Do we need a way to represent a specific version of software. Could be extra attributes on the link, or simply providing a more specific href.

Example

{
    "links": [
        { "rel": "self", "href": "https://www.fsa.usda.gov/my-real-home/naip/30087/m_3008718_sw_16_1_20130805.json" },
        { "rel": "root", "href": "https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/catalog.json"}

        { "rel": "derived_from", "href": "https://ga.gov.au/sr/surface-prod-askj2ka/item.json",
          "type": "surface-reflectance"},
        { "rel": "derived_from", "href": "https://ga.gov.au/sr/surface-prod-KKaslkn2/item.json",
          "type": "dsm" },
        { "rel": "derived_from", "href": "https://ga.gov.au/sr/surface-prod-KKaslkn2/item.json",
          "type": "dsm" },
        { "rel": "software", "href": "https://github.com/GeoscienceAustralia/fc",
          "sha": "v0.6.2"}
    ],
}

Storing detailed information about data production (Optional Extension)

In some cases even more detailed information might be desirable. This is a quick placeholder example, but needs a lot more work before being included as a STAC Extension.

{
    "production_information": {
        "software": {
            "href": "https://github.com/GeoscienceAustralia/fc",
            "sha": "v0.6.2"
        },
        "environment": {
            "machine_uname": "Linux r1561 3.10.0-693.17.1.el6.x86_64",
            "hostname": "r1561",
            "pythonversion": "3.6.5 | (default, Mar 29 2018, 23:19:37) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]",
        }

    }
}
@m-mohr
Copy link
Collaborator

m-mohr commented Aug 15, 2018

The dataset group was talking about that, too. We have not really thought that out, but were thinking of something like that:

Process graph extension (pg) - Items and Datasets

Element Type Name Description
description string Description (required) Detailed multi-line description to fully explain the processing step. CommonMark 0.28 syntax MAY be used for rich text representation.
chain object Process chain TODO: Could be similar to the openEO process graph.

@matthewhanson
Copy link
Collaborator

@omad This is great, I recently have been creating derived items and was doing something similar....adding in links from the items that were used.

I'm not sure "type" is really all that necessary in the links though, the record that is linked to should provide that information, or at least not used in this way. I think "name" or "id" could be used to distinguish between the links, but "type" implies a specific type category but these have not been defined to include things like "surface-reflectance".

@hgs-msmith
Copy link
Contributor

I'm a strong advocate for standardizing provenance and expressing it as metadata, but I'm not convinced it belongs in STAC. By design, STAC should be limited to the subset of metadata that services data discovery. It seems unlikely that most clients will need to make provenance (at least detailed provenance) part of the discovery process.

I suggest that this is within the purview of the ARD group. Perhaps they will define a detailed provenance model, and summary labels/levels of provenance claims. Such a provenance label would belong in STAC, but not the entire provenance description.

@simonff
Copy link

simonff commented Aug 20, 2018

I would like to use the processing chain/provenance standards, once they exist, within the Earth Engine catalog - not just for obvious provenance-displaying purposes (at both dataset and item levels), but also for error representation. Specifically, when a final EE asset is missing due to an error at one of the previous steps that I can track, I'd like to display it to the users (and eventually build some tools for summarizing/fixing the issues). Not sure whether this fits into ARD, STAC, or both - the upshot is that I would like to have a detailed provenance standard for operational purposes.

@cholmes
Copy link
Contributor

cholmes commented Aug 20, 2018

I agree that in an ideal world someone else standardizes provenance, and we just incorporate it. Especially a detailed provenance model.

But I do feel it's so important to get right as we move to the cloud, that I'd like to see us at least have a couple of tags that point the right way. And I think the ARD stuff is moving slower than we need it.

I do agree with the measure of what's in STAC as 'is it part of the discovery process'. But I think being able to do a search of images derived_from a particular landsat scene is incredibly valuable, and something that has been missing in the geospatial world.

I also think the 'links' section of STAC should be very welcoming, as good links will enable future crawlers to be a lot more effective.

So I do think we should put the data production information in its own repo - draw that line to not be STAC (though happy if people from here collaborate on it, and it's a set of work I very much want to see). But I think the 'derived_from' STAC extension that describes the use of that link makes good sense. And I think I'd include the 'software' link in there too.

@simonff and @omad - I can start a repo for you under RadiantEarth github if you want a neutral place to collaborate on processing chain stuff.

@matthewhanson
Copy link
Collaborator

I like the idea of a processing chain extension, but in the meantime I think a simple accepted way of defining parents (ie derived_from) in the links is a good way to start. Just having that ability to created derived products and easily be able to follow links to what items were used to generate it is extremely useful.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 20, 2018

I also would like to have derived_from added to the spec. Depending on the level of detail this could be an extension/profile or not. Just linking to derived datasets could probably be just be in the core, but if we add things like software etc. it sounds more suitable to make it an extension/profile.

The process chain extension may be inspired by openEO as we already have some specification on how to describe process chains in JSON in general.

Other than that, I think it's a good idea to get the provenance out of the STAC repo. It's just such an enormous effort that it should be spec'ed separately and eventually get adopted by STAC.

@simonff
Copy link

simonff commented Aug 21, 2018

I support all of the above - any steps we make toward improving provenance are great.

@cholmes
Copy link
Contributor

cholmes commented Aug 21, 2018

Cool. 1 field added sounds good. I'm inclined to just put it in 'core', especially since it's not required, and it's no change to the json schema, etc. It's just a particular type of link that we mention as being useful.

@hgs-msmith - does that sound ok to you? Or you want it in an extension? Or feel it shouldn't be in at all?

Anyone up to make a PR? Would also be great if we can get a catalog that shows it, even just a mini-catalog sample. @omad - can you share your catalogs? Would be psyched to add one or two to our implementation list whenever you're ready, and hopefully they show this. Would be great for stac-browser to also display that link. cc @mojodna

@cholmes cholmes added this to the 0.6.0 milestone Aug 21, 2018
@hgs-msmith
Copy link
Contributor

Elaborating on my earlier comment...

Imagine for a moment that an ARD standard exists, and that as a result ARD defines many standard processing chains and that each has a unique identifier (let's call it an ARD-code). Because it's a standard, everyone knows (or can easily look up) the detailed definition of the processing chain via its identifier. For purposes of data discovery, I need only designate the ARD-code(s) as part of my search filter if I care about only receiving results that satisfy my provenance requirements.

The obvious analogy is EPSG. If I care about the CRS and I want to define it as a search filter, I need only use the EPSG code(s) rather than define the entire projection/CRS in my search and in my STAC data records.

This is what I meant by a "label for a provenance claim".

Back to reality for minute - it will be quite some time before ARD can do this for us. In the meantime, if the group feels there is value in a "derived_from" tag, to support data discovery, I'm OK with it. I just don't want to see us go down the path of defining a whole schema for provenance within STAC.

@matthewhanson
Copy link
Collaborator

I think core is fine, it's not even a new field, it's just a type of rel link. I think in core we need to enumerate a list of possible rel links. It doesn't even need to be a complete list, but something for providers to go by so they don't just make it up.

I agree with @hgs-msmith's sentiments overall, but for now could really use a standard rel value for creating links when creating derived items.

@cholmes
Copy link
Contributor

cholmes commented Aug 23, 2018

Ah, that's a good point @matthewhanson - that list of possible rel link values was one of the things I was hoping for from the sprint, but focused people more on the asset types. Just made #191 for that.

And sounds like we've reached good consensus. Definitely agree with @hgs-msmith's points and they're in line with how I was thinking about it. We put in the minimal amount of 'link' relationship in STAC, for data discovery. And hope for someone to do more robust stuff. And I like the idea of the ARD-code.

PR welcome, or I'll try to get to it today...

@cholmes cholmes removed this from the 0.6.0 milestone Aug 23, 2018
@cholmes cholmes added prio: must-have required for release associated with minor a relatively small change to the spec labels Aug 23, 2018
@cholmes cholmes added this to the 0.6.0-RC1 milestone Aug 24, 2018
cholmes added a commit that referenced this issue Oct 2, 2018
See #179 

We likely should also create an extension that more fully spells out some of the ideas for this field, as well as additions to make the links more useful. Like explain that it could be used to link to non-STAC metadata, etc. But wanted to get this in so people could start to use it.
@cholmes cholmes closed this as completed Oct 8, 2018
@6footdestiny
Copy link

I came across this thread late as I'm just looking at STAC for a project we have in gestation. Coming from an ISO background I have to say I find the web first approach very refreshing!
I had a specific question though about provenance modelling (and I fully accept the distinction between discovery and use metadata and the polluting effect of the latter on the former).
Anyhow, has anyone in STAC community considered the W3C prov model for provenance capture. Its done a lot of the hard work and theres a JSON encoding to boot.. ?

@cholmes
Copy link
Contributor

cholmes commented Nov 23, 2018

Thanks for the positive words @6footdestiny!

I looked at w3c prov a bit and it looked quite promising. I do have questions about how widely used it is, do you have any insight on that? It seems to be a 'group note' status from 2013, and hasn't advanced to 'recommendation'?

We definitely do not want to re-invent a full provenance model, and do not see that in scope for STAC. But we wanted to emphasize that it was important to us, so just added the single field, keeping it as minimal as possible. Ideally we find / collaborate on a good provenance model and are able to just reference it's use, and replace derived_from with something from there.

What I'd love to see is a STAC extension that shows how to use the w3c prov model in JSON with common geospatial constructs. Indeed a lot of the interest from the STAC group is how to represent programmatic processing (generally on cloud environments) as part of the processing, and to track that provenance. From what I've seen w3c prov examples seem more about people / scientific provenance, though it does seem flexible enough for machine stuff. But I think there are lots of little quirks of geospatial processing, so some fleshing out of how to exactly use the w3c prov for STAC seems necessary.

I imagine before long we'll have a sub-group of interested STAC collaborators take on provenance, and they'll look at all the use cases and the candidates out there, and I definitely think w3c prov is the top of the list.

@6footdestiny
Copy link

@cholmes I don't have any specific intel on the adoption rate of Prov although I know from colleagues in the UK academic sector that were involved in it that there is still active repo updates for software implementation (python) (see https://www.software.ac.uk/who-do-we-work/provenance-tool-suite).
The prov model was designed for scientiofc workflow in mind albeit at an abstract level to permit wider application across domains and use cases. So I think its rich enough to support the STAC use case. The only alternative I'm aware of is the ISO 1900 series lineage model which is perhaps similarly broad and whilst it hails from the geo domain I don't feel (IMO) that its that widely adopted. For my sins I sit on the ISO 19165 drafting committee which is aimed at long term preservation of geo assets. Its heavily modelled around the full ISO standards stack but is being driven by an EO use case (at least 19165-2 which is an extension of the core 19165 model is).
So, basically you have W3C or ISO as possibles or some other 'lightweight' new alternative (in many respects STAC is an new wave approach to 19115 and corresponding CSW preoccupations that date almost to pre-web origins and have significant traction in things like INSPIRE (Europe) and GCMD/NAP in US). Personally, I think the XML centric and dense formalisms of ISO have proven somewhat counter intuitive to a web-first philosophy and are now suffering from some kick back - hence the Spatial Data for the Web charter. I do concur that reinvention is to be avoided. Prov might be the answer as it has JSON flavours but ultimately it will fall to how many folks actually adopt and use it voluntarily. ISO has been a struggle in UK and its only legislation that effectively mandates its use that has forced any headway.. and even then to most folks its overkill.
Anyway, happy to contribute. I'm involved in a project that will need to adopt something so I'm interested to see what evolves....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor a relatively small change to the spec prio: must-have required for release associated with
Projects
None yet
Development

No branches or pull requests

7 participants