better handling of case where same genome has multiple annotation versions #833

adf-ncgr · 2023-03-02T13:55:41Z

currently, there's no very good way to support this. If you leave the chromosome ids unchanged, then you would end up getting chromosome with all annotations superimposed. When it has occurred in the past, I've just tweaked the chromosomes ids associated with one of the annotations, but as we begin to support genomic region linkouts, this becomes more problematic since the chromosome ids don't really exist outside of GCV. We do need to somehow internally treat them as distinct but they need to be able to reference the same external entity. It could get confusing to the user if the GCV labels used on the macrosynteny representation and track labels don't distinguish them too.

alancleary · 2023-03-02T15:43:47Z

Perhaps it's time for the data model to be updated to not require chromosome names be unique. This certainly isn't the first time this has been an issue. You've already implemented the usual workaround by just prefixing the chromosome names with something to make them unique, but as you noted this can break interoperablility with other tools/services.

What you prefix a chromosome name with to make it unique depends on your use case. Here it's annotation version. In the pangenome it's accession. For people that don't normally do full yuck naming it's species. So probably the thing to do is use internal IDs to keep things unique. This could be paired with support for arbitrary "metadata" and the ability to format what (meta) data a track name shows via the config file.

What do you think @adf-ncgr? I think it's a nice solution in theory but loading these meta data via the GFF loader could be a bit painful.

adf-ncgr · 2023-03-02T15:56:38Z

Yes, I think a data model update is the way to go. Most (if not all) of the metadata here is not a property of individual chromosomes or gene but of the genome itself. The gff loader currently gets organism info via commandline options, not the gff per se. But I think we do need uniqueness at some level to support linking-in scenarios, especially for sets of objects.

alancleary · 2023-03-02T16:35:55Z

The gff loader currently gets organism info via commandline options, not the gff per se.

Right. We would probably need to do something more flexible for arbitrary metadata, like loading a CSV. Not sure if it makes sense to parse these data from that file as well.

I think we do need uniqueness at some level to support linking-in scenarios, especially for sets of objects.

Good point! I was thinking of things from the link-out perspective. So metadata or not there needs to be a UI facing way to uniquely identify chromosomes. I guess we could add an accession field and require that the combination of species+accession+chromosome be unique, but this seems like we're just pushing the current workaround into its own designated field...

adf-ncgr · 2023-03-02T17:43:16Z

I think we could imagine that a given "gene build" has an identifier serves as a kind of container for chromosomes and genes, whose names are unique within the container and unique globally when the container id is considered as their namespace. It's probably not all that different from the way we specify "full yuck" in LIS, except for the fact that chromosomes would have identities relative to a "gene build".

alancleary · 2023-03-02T19:05:10Z

Forgive my naivete; is the "gene build" something that could be easily gleaned from the GFF/GFA files?

adf-ncgr · 2023-03-02T20:11:30Z

No penance needed, I am just using the term to refer to the result of an annotation effort, so basically it is the GFF file.

alancleary · 2023-03-02T20:29:02Z

Hmm. I definitely see the uniqueness there but I'm not sure that's the most convenient/logical option for every use case. Will have to ponder it some more.

adf-ncgr · 2023-03-02T20:37:08Z

maybe we should have a real time discussion before you ponder too hard, I may not be describing what I'm imagining very well.

alancleary mentioned this issue Aug 17, 2023

GCV 3 #845

Open

20 tasks

alancleary added the enhancement label Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better handling of case where same genome has multiple annotation versions #833

better handling of case where same genome has multiple annotation versions #833

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

better handling of case where same genome has multiple annotation versions #833

better handling of case where same genome has multiple annotation versions #833

Comments

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023

alancleary commented Mar 2, 2023

adf-ncgr commented Mar 2, 2023