-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add region input queries #97
base: main
Are you sure you want to change the base?
Conversation
@@ -16,6 +16,8 @@ type Query { | |||
start: Int @deprecated(reason: "Use `by_slice`"), | |||
end: Int @deprecated(reason: "Use `by_slice`") | |||
by_slice: SliceInput): Locus | |||
region(by_name: RegionNameInput!): Region | |||
regions(by_genome_id: GenomeIdInput!): [Region] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I realised we are missing from the CDM — if we allow queries like this — is a field with some kind of ranking information in Region
.
Consider an example in which you are requesting all human chromosomes. You will likely want to display them in a specific order. There is no way for the client to know, for example, that human chromosome X goes after human chromosome 22, unless someone has manually curated the list of top-level regions, and the backend tells the client this order.
cc: @bethflint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a CDM thing or a thing we want to support in our GraphQL service which is not part of CDM? And where does this data currently reside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a CDM thing to me. One needs to somehow represent the information that the first chromosome of Saccharomyces cerevisiae is chromosome I
, or that human karyotype ends in chromosomes 22, X, Y, MT
, in that order.
And where does this data currently reside?
🤷♂️
do you know @ens-st3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL [homo_sapiens_core_108_38]> select sr.name, sra.value from attrib_type at, seq_region_attrib sra, seq_region sr where at.attrib_type_id = sra.attrib_type_id and sra.seq_region_id = sr.seq_region_id and at.code = 'karyotype_rank' order by sra.value;
+------+-------+
| name | value |
+------+-------+
| 1 | 1 |
| 10 | 10 |
| 11 | 11 |
| 12 | 12 |
| 13 | 13 |
| 14 | 14 |
| 15 | 15 |
| 16 | 16 |
| 17 | 17 |
| 18 | 18 |
| 19 | 19 |
| 2 | 2 |
| 20 | 20 |
| 21 | 21 |
| 22 | 22 |
| X | 23 |
| Y | 24 |
| MT | 25 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
+------+-------+
25 rows in set (0.003 sec)
query = { | ||
"type": "Region", | ||
"genome_id": by_genome_id["genome_id"], | ||
"code": "chromosome", # We filter on chromosomes to avoid returning an overwhelming number of regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, do you have an overwhelming number of regions? Then this is likely another of our "long lists of things" cases, and we should give the client an option to request different kinds of regions. Possibly making it an optional query parameter, in the absence of which Thoas will respond with top-level regions.
By the way, not all organisms in our databases will have top-level regions that are chromosomes. @ens-st3 has chosen an organism (Chacoan peccary, I believe) whose genome has only been sequenced to a very provisional level (primary assemblies, is it?) for inclusion in Thoas as part of the next batch of species, to confirm that we are able to handle organisms like that. The code above won't be able to find top-level regions for such organisms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was because it was mentioned that there would potentially be millions of regions for some assemblies. I think a decision needs to be made about how we handle pagination before we can handle those very large numbers.
Do we know how many non-chromosomal regions the assemblies we know we're going to be handling in the short and mid term are going to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an attribute in the "attrib_type" table called "Top level" with description "Top Level Non-Redundant Sequence Region" which looks like it could be a better candidate to filter on than "chromosome"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL [catagonus_wagneri_core_108_2]> select count() from seq_region sr, seq_region_attrib sra, attrib_type at where sr.seq_region_id = sra.seq_region_id and sra.attrib_type_id = at.attrib_type_id and at.code = 'toplevel';
+----------+
| count() |
+----------+
| 1183730 |
+----------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I've added it to sta-6 just in case we choose to support these now. If there are similar genomes in rapid, which is prob unlikely then we will need to, but if not it's our choice.
Another 'interesting' example is very long top level sequences, eg
MySQL [monodelphis_domestica_core_108_1]> select max(sr.length) from seq_region sr, seq_region_attrib sra, attrib_type at where sr.seq_region_id = sra.seq_region_id and sra.attrib_type_id = at.attrib_type_id and at.code = 'toplevel';
+----------------+
| max(sr.length) |
+----------------+
| 748055161 |
+----------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Steve's first query shows that filtering on "toplevel" wouldn't product us from getting back an overwhelming number of regions, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll exclude the constituent parts of assemblies (contigs, clones etc) which will help massively. There should never be any annotation on these - they should only ever be on toplevel regions - and I suggest there's no reason for them to be in Thoas.
Is 1.2M toplevel sequences overwhelming ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if 1.2M would be overwhelming. Users can already overwhelm the service if they want to by sending very deep queries. I agree with @bethflint , I think we should make a decision on pagination. That would give us a way to limit queries, and then we won't have to worry about them being too heavy.
https://www.ebi.ac.uk/panda/jira/browse/EA-1016
https://www.ebi.ac.uk/panda/jira/browse/EA-1018
https://www.ebi.ac.uk/panda/jira/browse/EA-1017
This PR adds region input queries, associated exceptions, and tests.
Mypy, Pylint, and Pytest all pass.
This is a draft because we cannot push it until we have updated the loading scripts populate regions with genome_ids.