-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better handling of case where same genome has multiple annotation versions #833
Comments
Perhaps it's time for the data model to be updated to not require chromosome names be unique. This certainly isn't the first time this has been an issue. You've already implemented the usual workaround by just prefixing the chromosome names with something to make them unique, but as you noted this can break interoperablility with other tools/services. What you prefix a chromosome name with to make it unique depends on your use case. Here it's annotation version. In the pangenome it's accession. For people that don't normally do full yuck naming it's species. So probably the thing to do is use internal IDs to keep things unique. This could be paired with support for arbitrary "metadata" and the ability to format what (meta) data a track name shows via the config file. What do you think @adf-ncgr? I think it's a nice solution in theory but loading these meta data via the GFF loader could be a bit painful. |
Yes, I think a data model update is the way to go. Most (if not all) of the metadata here is not a property of individual chromosomes or gene but of the genome itself. The gff loader currently gets organism info via commandline options, not the gff per se. But I think we do need uniqueness at some level to support linking-in scenarios, especially for sets of objects. |
Right. We would probably need to do something more flexible for arbitrary metadata, like loading a CSV. Not sure if it makes sense to parse these data from that file as well.
Good point! I was thinking of things from the link-out perspective. So metadata or not there needs to be a UI facing way to uniquely identify chromosomes. I guess we could add an accession field and require that the combination of species+accession+chromosome be unique, but this seems like we're just pushing the current workaround into its own designated field... |
I think we could imagine that a given "gene build" has an identifier serves as a kind of container for chromosomes and genes, whose names are unique within the container and unique globally when the container id is considered as their namespace. It's probably not all that different from the way we specify "full yuck" in LIS, except for the fact that chromosomes would have identities relative to a "gene build". |
Forgive my naivete; is the "gene build" something that could be easily gleaned from the GFF/GFA files? |
No penance needed, I am just using the term to refer to the result of an annotation effort, so basically it is the GFF file. |
Hmm. I definitely see the uniqueness there but I'm not sure that's the most convenient/logical option for every use case. Will have to ponder it some more. |
maybe we should have a real time discussion before you ponder too hard, I may not be describing what I'm imagining very well. |
currently, there's no very good way to support this. If you leave the chromosome ids unchanged, then you would end up getting chromosome with all annotations superimposed. When it has occurred in the past, I've just tweaked the chromosomes ids associated with one of the annotations, but as we begin to support genomic region linkouts, this becomes more problematic since the chromosome ids don't really exist outside of GCV. We do need to somehow internally treat them as distinct but they need to be able to reference the same external entity. It could get confusing to the user if the GCV labels used on the macrosynteny representation and track labels don't distinguish them too.
The text was updated successfully, but these errors were encountered: