Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch >1000 documents with a POST query #233

Closed
erikyao opened this issue Jun 7, 2022 · 12 comments
Closed

Fetch >1000 documents with a POST query #233

erikyao opened this issue Jun 7, 2022 · 12 comments
Assignees

Comments

@erikyao
Copy link
Contributor

erikyao commented Jun 7, 2022

Originally from @colleenXu:

To find associations between things, we are mostly doing POST queries to biothings apis.

For POST queries, we can retrieve <=1000 records per input (think of a batch-query of input IDs like below). This allows a batch-query to include up to 1000 inputs.

POST to https://mydisease.info/v1/query?fields=disgenet.xrefs,_id&size=1000 with the body:
{
"q": "7157,7180,7190",
"scopes": "disgenet.genes_related_to_disease.gene_id"
}

My understanding is that there's only 1 way to change this situation:

  1. CANNOT DO fetch_all, since that only works for GET queries (and just using GET queries isn't a viable solution because not being able to batch-query can slow down multi-hop BTE queries quite a bit).
  2. CAN DO: The only way to get >1000 records per input is to adjust the biothings api settings - which would likely involve lowering the batch-query limit (ex: 10000 records per input and 100 IDs per batch). This can perhaps be done on a per-api basis (like specific pending apis???)

Noting that this has been a discussion topic for a while. And for now, we've been keeping okay at keeping things at <= 1000 records per input, knowing that we are not getting the complete response. This is because it is difficult handling a node attached to lots of other entities...

However, this is known to be more of an issue for APIs that keep many separate records for the same basic association X-related_to-Y. This happens with semmeddb (at least 1 record per publication-association) and some multiomics apis. These are all on the pending api hub.

@erikyao erikyao changed the title Fetch >1000 records with a POST query Fetch >1000 documents with a POST query Jun 7, 2022
@erikyao
Copy link
Contributor Author

erikyao commented Jun 7, 2022

From @newgene

... looks like Colleen means to get >1000 hits of each input from a POST query

For a particular pending api, I think we can increase that default size limit (each doc is pretty small, not like mygene or mychem apis)

@erikyao
Copy link
Contributor Author

erikyao commented Jun 7, 2022

P.S. GET queries can use scroll_id and fetch_all for this purpose. See https://docs.mygene.info/en/latest/doc/query_service.html?highlight=scroll#scrolling-queries

The python biothings_client is also capable at this.

@andrewsu
Copy link
Member

andrewsu commented Jun 7, 2022

right, my interest is whether it is possible and desirable to generically implement the fetch_all / scroll_id method (or even the size / from method) on the POST endpoint. Is that worth building into the SDK?

@erikyao
Copy link
Contributor Author

erikyao commented Jun 7, 2022

Hi @andrewsu, technically it's possible to implement in the SDK

@namespacestd0
Copy link
Contributor

"from" and "size" might already be supported on the query POST endpoint:

'from': {'type': int, 'max': 10000, 'alias': 'skip'},

'size': {'type': int, 'max': 1000, 'alias': 'limit'},

@colleenXu
Copy link

(Quick comment, I haven't read this thread yet)

Replying to @namespacestd0 and @erikyao

I think @tokebe said this was a bit hard to implement because it wasn't clear how many records total to go through (if using from to "scroll" through.

@newgene
Copy link
Member

newgene commented Jun 7, 2022

Thanks @namespacestd0! You reminded me this feature you added for #108.

So, on BTE side, we need to add the support for this.

If helps, we can also increase the max size settings for some particular APIs, if we know the doc size are small enough, so increasing it to 5000 or 10000 won't be an issue for the server. This will be a per-API setting.

@tokebe
Copy link
Member

tokebe commented Jun 7, 2022

Just a point of clarity on my comment @colleenXu referenced, it's not so much that it would be difficult as much as BTE would be sending wasteful requests if it doesn't know from the first response how many times it should be paging/how many records there are, as it would have to keep paging until it receives nothing back.

@newgene
Copy link
Member

newgene commented Jun 7, 2022

@tokebe @colleenXu I am pretty sure we can figure out a way to pass this number to BTE.

@erikyao
Copy link
Contributor Author

erikyao commented Jun 15, 2022

Decision 06/15: output the total number of documents in a special field

@erikyao
Copy link
Contributor Author

erikyao commented Aug 4, 2022

Related: #49

@andrewsu
Copy link
Member

closing as complete per discussion at https://suwulab.slack.com/archives/CC19LHAF2/p1664553072489639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants