Fetch >1000 documents with a POST query #233

erikyao · 2022-06-07T16:32:54Z

Originally from @colleenXu:

To find associations between things, we are mostly doing POST queries to biothings apis.

For POST queries, we can retrieve <=1000 records per input (think of a batch-query of input IDs like below). This allows a batch-query to include up to 1000 inputs.

POST to https://mydisease.info/v1/query?fields=disgenet.xrefs,_id&size=1000 with the body:
{
"q": "7157,7180,7190",
"scopes": "disgenet.genes_related_to_disease.gene_id"
}

My understanding is that there's only 1 way to change this situation:

CANNOT DO fetch_all, since that only works for GET queries (and just using GET queries isn't a viable solution because not being able to batch-query can slow down multi-hop BTE queries quite a bit).

CAN DO: The only way to get >1000 records per input is to adjust the biothings api settings - which would likely involve lowering the batch-query limit (ex: 10000 records per input and 100 IDs per batch). This can perhaps be done on a per-api basis (like specific pending apis???)

Noting that this has been a discussion topic for a while. And for now, we've been keeping okay at keeping things at <= 1000 records per input, knowing that we are not getting the complete response. This is because it is difficult handling a node attached to lots of other entities...

However, this is known to be more of an issue for APIs that keep many separate records for the same basic association X-related_to-Y. This happens with semmeddb (at least 1 record per publication-association) and some multiomics apis. These are all on the pending api hub.

erikyao · 2022-06-07T16:49:03Z

From @newgene

... looks like Colleen means to get >1000 hits of each input from a POST query

For a particular pending api, I think we can increase that default size limit (each doc is pretty small, not like mygene or mychem apis)

erikyao · 2022-06-07T16:53:38Z

P.S. GET queries can use scroll_id and fetch_all for this purpose. See https://docs.mygene.info/en/latest/doc/query_service.html?highlight=scroll#scrolling-queries

The python biothings_client is also capable at this.

andrewsu · 2022-06-07T16:57:24Z

right, my interest is whether it is possible and desirable to generically implement the fetch_all / scroll_id method (or even the size / from method) on the POST endpoint. Is that worth building into the SDK?

erikyao · 2022-06-07T17:08:21Z

Hi @andrewsu, technically it's possible to implement in the SDK

namespacestd0 · 2022-06-07T18:24:50Z

"from" and "size" might already be supported on the query POST endpoint:

biothings.api/biothings/web/settings/default.py

Line 129 in 6dce4ce

'from': {'type': int, 'max': 10000, 'alias': 'skip'},

biothings.api/biothings/web/settings/default.py

Line 102 in 6dce4ce

'size': {'type': int, 'max': 1000, 'alias': 'limit'},

colleenXu · 2022-06-07T19:44:16Z

(Quick comment, I haven't read this thread yet)

Replying to @namespacestd0 and @erikyao

I think @tokebe said this was a bit hard to implement because it wasn't clear how many records total to go through (if using from to "scroll" through.

newgene · 2022-06-07T19:44:34Z

Thanks @namespacestd0! You reminded me this feature you added for #108.

So, on BTE side, we need to add the support for this.

If helps, we can also increase the max size settings for some particular APIs, if we know the doc size are small enough, so increasing it to 5000 or 10000 won't be an issue for the server. This will be a per-API setting.

tokebe · 2022-06-07T20:08:12Z

Just a point of clarity on my comment @colleenXu referenced, it's not so much that it would be difficult as much as BTE would be sending wasteful requests if it doesn't know from the first response how many times it should be paging/how many records there are, as it would have to keep paging until it receives nothing back.

newgene · 2022-06-07T20:15:38Z

@tokebe @colleenXu I am pretty sure we can figure out a way to pass this number to BTE.

erikyao · 2022-06-15T17:02:26Z

Decision 06/15: output the total number of documents in a special field

erikyao · 2022-08-04T21:03:30Z

Related: #49

andrewsu · 2022-09-30T19:56:06Z

closing as complete per discussion at https://suwulab.slack.com/archives/CC19LHAF2/p1664553072489639

erikyao changed the title ~~Fetch >1000 records with a POST query~~ Fetch >1000 documents with a POST query Jun 7, 2022

erikyao assigned erikyao, newgene, andrewsu and colleenXu Jun 7, 2022

newgene added the enhancement label Jun 22, 2022

newgene assigned sengineer0 Jul 22, 2022

sengineer0 mentioned this issue Jul 29, 2022

Support: 'Fetch >1000 documents with a POST query' using with_total option newgene/biothings.api#52

Merged

sengineer0 mentioned this issue Aug 8, 2022

Add tests for query webapi, when sending POST with 'with_total' newgene/biothings.api#57

Merged

andrewsu closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch >1000 documents with a POST query #233

Fetch >1000 documents with a POST query #233

erikyao commented Jun 7, 2022

erikyao commented Jun 7, 2022

erikyao commented Jun 7, 2022 •

edited

Loading

andrewsu commented Jun 7, 2022

erikyao commented Jun 7, 2022

namespacestd0 commented Jun 7, 2022

colleenXu commented Jun 7, 2022

newgene commented Jun 7, 2022

tokebe commented Jun 7, 2022

newgene commented Jun 7, 2022

erikyao commented Jun 15, 2022

erikyao commented Aug 4, 2022

andrewsu commented Sep 30, 2022

Fetch >1000 documents with a POST query #233

Fetch >1000 documents with a POST query #233

Comments

erikyao commented Jun 7, 2022

erikyao commented Jun 7, 2022

erikyao commented Jun 7, 2022 • edited Loading

andrewsu commented Jun 7, 2022

erikyao commented Jun 7, 2022

namespacestd0 commented Jun 7, 2022

colleenXu commented Jun 7, 2022

newgene commented Jun 7, 2022

tokebe commented Jun 7, 2022

newgene commented Jun 7, 2022

erikyao commented Jun 15, 2022

erikyao commented Aug 4, 2022

andrewsu commented Sep 30, 2022

erikyao commented Jun 7, 2022 •

edited

Loading