-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that GPAD (2.0) can be used as a primary input into the pipeline #1443
Comments
Datasets Yaml proposal:
One question though that I have is which datasets are we considering to by the "correct" canonical datasets for a group? Now that we are expecting multiple types of files (gaf or gpad) we need a way of deciding which one is more canonical. For example, in the We could just go totally one to one: where the goa_chicken.gaf gets processed to our canonical goa_chicken.gaf, the goa_chicken.gpad/gpi gets processed into our goa_chicken.gpad/gpi. To contrast, our current process takes GAF, and then produces all types from the one file. So we have a one to many. We could also indicate in the yaml file which datasets are the blessed ones to use. Like:
This allows the providers to say that this gaf source should be used, while that gpad source should be used. Or perhaps there should be a flag in the datasets stanzas? Ultimately, we need a way to map to our outputs. And having 3 sources that have |
@dougli1sqrd I think for deciding what is canonical, that can be left to policy: I can think of no reason to ever have redundant datasets coming from a resource and we should work to prevent that. |
Ah okay, so we should have goa decide either goa_human gaf or goa_human gpad+gpi, and then remove the redundant one? I wonder if we can change the schema to allow only unique |
GPI can be there wither way, it's whether the GAF or GPAD(+GPI) should be the "primary" data source, assuming they aren't both different things. The assumption should be that unless otherwise marked, all data is processed through the pipeline. I would either comment out or add a tag for "inactive" (don't we have this) to track things that should not be processed normally. |
Yeah, we have a key for Yeah, if we could devise a schema that detected this, that would be lovely. Otherwise we'll need validation logic. If ontobio detects that there are more than one source type for a |
Is the issue that there is a single identifier for a dataset so the collide? Either way, even if it cannot be encoded in a schema checker, it can be enforced, see https://github.com/geneontology/go-site/blob/master/scripts/sanity-check-users-and-groups.py |
Essentially yes. The As for enforcing scripts, perfect! |
I would propose then that this is an issue of policy:
|
Do we have cases where there is more than one file ? |
Yes, which is what started this thread--see @dougli1sqrd 's example at the top. |
|
For specifically enabling gpad 2.0, that tracking will be done here: #1453 |
Things that will have to happen in ontobio:
|
Bringing in the context of issue #1384: (#1384 (comment)) we would like to move the Paint, etc mixin process to after we have downloaded and validated all files first. However, as we see in my comment in #1384:
Additional notes on this: I foresee a large testing period after the first "working" PR gets done to ensure that this will pan out the way we would like. |
@kltm is this still an outstanding problem? |
@suzialeksander This is still "open". Our recent work has been on outputs, not inputs. |
As we move forward, with the active Noctua imports being the deadline, the pipeline needs to be able to process GPAD 2.0 as a primary input component (GAF being the only other primary data component at this time).
Tied to this, to actually make use of GPAD elsewhere in the pipeline (e.g. produce GAF for AmiGO to consume), we'll also need to be making use of GPI 2.0 files.
Tagging @dougli1sqrd
The text was updated successfully, but these errors were encountered: