Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results #806

Closed
kheal opened this issue Dec 2, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@kheal
Copy link

kheal commented Dec 2, 2024

Describe the bug
When using pagination in the "/nmdcschema/{collection_name}" endpoint (with the functional_annotation_agg collection), the results always contain a next_page_token causing an infinite loop in fetching scripts.

To Reproduce

import requests
import time

# Get initial results (before next_page_token is given in the results)
collection = "functional_annotation_agg"
fields="was_generated_by"
max_page_size = 5000
filter = '{"was_generated_by":{"$regex":"^nmdc:wfmp"}}'

time_start = time.time()
result_list = []
og_url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&projection={fields}"
resp = requests.get(og_url)
initial_data = resp.json()
results = initial_data.get("resources", [])
i = 0

# append first page of results to an empty list
for result in results:
    result_list.append(result)

# if there are multiple pages of results returned
if initial_data.get("next_page_token"):
    next_page_token = initial_data["next_page_token"]

    while True:
        i = i + max_page_size
        print(str(i) + " records processed")
        url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&next_page_token={next_page_token}&projection={fields}"
        response = requests.get(url)
        data_next = response.json()

        results = data_next.get("resources", [])
        result_list.extend(results)
        next_page_token = data_next.get("next_page_token")

        if not next_page_token:
            break

print(len(result_list), "records found in", round(time.time() - time_start, 2), "seconds")

Expected behavior

With max_page_size = 0, we fetch 25118 records found in 0.51 seconds.

While pagination may take longer, I expect to retrieve the same number of records before the next_page_token is null and the loop breaks. Instead, if max_page_size = 5000, the loop never seems to break (I got it up to >100,000 records before recording this bug).

While this isn't an immediate blocker, we need some mechanism to fetch these records with pagination to sustain the aggregator.

Acceptance Criteria

  • Pagination on the "/nmdcschema/{collection_name}" with functional_annotation_agg endpoint should result in the same number of records as non-paginated results.

Additional context
We use this API call for gathering the IDs of the previously-aggregated workflows (see microbiomedata/nmdc-aggregator#27)

@kheal kheal added the bug Something isn't working label Dec 2, 2024
@kheal
Copy link
Author

kheal commented Dec 2, 2024

@dwinston @eecavanna - not sure if I've assigned the right folks, just wanted to make sure this was going to get the attention of the appropriate people.

@kheal
Copy link
Author

kheal commented Dec 2, 2024

Testing this with a different collection and I'm seeing the same results. Honing in a bit, I don't think the next_page_token is actually giving the next page's results in this endpoint.

https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter={"type":"nmdc:MetaproteomicsAnalysis"}&max_page_size=10&projection='

returns a next_page_token of nmdc:sys08k2w6w51, but when trying to get the next page's results with this url
https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter={"type":"nmdc:MetaproteomicsAnalysis"}&next_page_token=nmdc:sys08k2w6w51&max_page_size=10&projection='

They return the exact same results. Hopefully this helps narrow down the bug. I've updated the title to match.

@kheal kheal changed the title BUG: API endpoint "/nmdcschema/{collection_name}" gets stuck in infinite loop (with functional_annotation_agg) collection BUG: API endpoint "/nmdcschema/{collection_name}"'s next_page_token doesn't yield next page's results Dec 2, 2024
@eecavanna eecavanna changed the title BUG: API endpoint "/nmdcschema/{collection_name}"'s next_page_token doesn't yield next page's results BUG: API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results Dec 2, 2024
@eecavanna
Copy link
Collaborator

Thanks for reporting this, and for doing so this thoroughly. This does seem like a bug to me.

I think you have assigned the right people, given our current procedures. In the future, the procedure might change to: assign it to me and I'd do an initial assessment and pull in others as needed, so they can otherwise focus on whatever they've committed to for the sprint; but I'm not aware of that being the standard procedure yet (just a potential one, based upon some recent conversations).

Two things:

  1. I'll look into this now.
  2. The attendees of today's release planning meeting talked about this ticket. We were trying to decide whether to hold up the December 16 release for this bug. A couple of us saw that you wrote "While this isn't an immediate blocker...". It sounds to me like you have a workaround for this bug (the workaround being to not use pagination)—is that correct? If so, while we'll try to fix the bug, we won't hold up the release for it.

@eecavanna
Copy link
Collaborator

I'll remove @dwinston as an assignee for now and will reassign him with commentary after I do some digging.

@eecavanna
Copy link
Collaborator

eecavanna commented Dec 2, 2024

Hi @kheal, I think the issue is that the endpoint is expecting the URL query parameter to be named page_token as opposed to next_page_token.

image

I see it being named next_page_token in the Python snippet in the issue description:

    while True:
        i = i + max_page_size
        print(str(i) + " records processed")
        url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&next_page_token={next_page_token}&projection={fields}"

I assume the endpoint is not seeing that token and so is always returning the first page of results (and the same next_page_token value).

Will you retry with the parameter being named page_token instead?

- url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&next_page_token={next_page_token}&projection={fields}"
+ url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&page_token={next_page_token}&projection={fields}"

@eecavanna eecavanna changed the title BUG: API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results Dec 2, 2024
@eecavanna
Copy link
Collaborator

I confirmed that, when I name the query parameter page_token, the API returns a different set of resources and a different next_page_token value for each "page".

$ curl -X GET 'https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter=%7B%22type%22%3A%22nmdc%3AMetaproteomicsAnalysis%22%7D&max_page_size=10&projection='
{"resources":[{"id":"nmdc:wfmp-11-1ky3j817.1", ...}, ...],"next_page_token":"nmdc:sys0dyggry73"}%

$ curl -X GET 'https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter=%7B%22type%22%3A%22nmdc%3AMetaproteomicsAnalysis%22%7D&max_page_size=10&projection=&page_token=nmdc:sys0dyggry73'
{"resources":[{"id":"nmdc:wfmp-11-6n1xme37.1", ...}, ...],"next_page_token":"nmdc:sys0jjz9ym68"}%

$ curl -X GET 'https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter=%7B%22type%22%3A%22nmdc%3AMetaproteomicsAnalysis%22%7D&max_page_size=10&projection=&page_token=nmdc:sys0jjz9ym68'
{"resources":[{"id":"nmdc:wfmp-11-f6sfn088.1", ...}, ...],"next_page_token":"nmdc:sys06714m695"}%

@eecavanna eecavanna moved this from In Progress to In Review in 2024 - Sprint 51 - December 2 - 13, 2024 Dec 2, 2024
@kheal
Copy link
Author

kheal commented Dec 2, 2024

User error is my favorite type of bug - sorry for the unnecessary chatter. I'll test and close the issue after fixing the query filter. Maybe the real issue is that there is no feedback for sending incorrect parameter keys.

@eecavanna
Copy link
Collaborator

Thanks for including the reproduction info! I think that really reduced the time involved in identifying the root cause.

Maybe the real issue is that there is no feedback for sending incorrect parameter keys.

I'm on my way to a meeting from 2-3pm. I'll file a ticket about that after the meeting in case one doesn't exist by then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

3 participants