Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling of case where same genome has multiple annotation versions #833

Open
adf-ncgr opened this issue Mar 2, 2023 · 8 comments

Comments

@adf-ncgr
Copy link
Contributor

adf-ncgr commented Mar 2, 2023

currently, there's no very good way to support this. If you leave the chromosome ids unchanged, then you would end up getting chromosome with all annotations superimposed. When it has occurred in the past, I've just tweaked the chromosomes ids associated with one of the annotations, but as we begin to support genomic region linkouts, this becomes more problematic since the chromosome ids don't really exist outside of GCV. We do need to somehow internally treat them as distinct but they need to be able to reference the same external entity. It could get confusing to the user if the GCV labels used on the macrosynteny representation and track labels don't distinguish them too.

@alancleary
Copy link
Contributor

Perhaps it's time for the data model to be updated to not require chromosome names be unique. This certainly isn't the first time this has been an issue. You've already implemented the usual workaround by just prefixing the chromosome names with something to make them unique, but as you noted this can break interoperablility with other tools/services.

What you prefix a chromosome name with to make it unique depends on your use case. Here it's annotation version. In the pangenome it's accession. For people that don't normally do full yuck naming it's species. So probably the thing to do is use internal IDs to keep things unique. This could be paired with support for arbitrary "metadata" and the ability to format what (meta) data a track name shows via the config file.

What do you think @adf-ncgr? I think it's a nice solution in theory but loading these meta data via the GFF loader could be a bit painful.

@adf-ncgr
Copy link
Contributor Author

adf-ncgr commented Mar 2, 2023

Yes, I think a data model update is the way to go. Most (if not all) of the metadata here is not a property of individual chromosomes or gene but of the genome itself. The gff loader currently gets organism info via commandline options, not the gff per se. But I think we do need uniqueness at some level to support linking-in scenarios, especially for sets of objects.

@alancleary
Copy link
Contributor

The gff loader currently gets organism info via commandline options, not the gff per se.

Right. We would probably need to do something more flexible for arbitrary metadata, like loading a CSV. Not sure if it makes sense to parse these data from that file as well.

I think we do need uniqueness at some level to support linking-in scenarios, especially for sets of objects.

Good point! I was thinking of things from the link-out perspective. So metadata or not there needs to be a UI facing way to uniquely identify chromosomes. I guess we could add an accession field and require that the combination of species+accession+chromosome be unique, but this seems like we're just pushing the current workaround into its own designated field...

@adf-ncgr
Copy link
Contributor Author

adf-ncgr commented Mar 2, 2023

I think we could imagine that a given "gene build" has an identifier serves as a kind of container for chromosomes and genes, whose names are unique within the container and unique globally when the container id is considered as their namespace. It's probably not all that different from the way we specify "full yuck" in LIS, except for the fact that chromosomes would have identities relative to a "gene build".

@alancleary
Copy link
Contributor

Forgive my naivete; is the "gene build" something that could be easily gleaned from the GFF/GFA files?

@adf-ncgr
Copy link
Contributor Author

adf-ncgr commented Mar 2, 2023

No penance needed, I am just using the term to refer to the result of an annotation effort, so basically it is the GFF file.

@alancleary
Copy link
Contributor

Hmm. I definitely see the uniqueness there but I'm not sure that's the most convenient/logical option for every use case. Will have to ponder it some more.

@adf-ncgr
Copy link
Contributor Author

adf-ncgr commented Mar 2, 2023

maybe we should have a real time discussion before you ponder too hard, I may not be describing what I'm imagining very well.

@alancleary alancleary mentioned this issue Aug 17, 2023
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants