API endpoint `/nmdcschema/{collection_name}`'s `next_page_token` doesn't yield next page's results #806

kheal · 2024-12-02T16:34:29Z

Describe the bug
When using pagination in the "/nmdcschema/{collection_name}" endpoint (with the functional_annotation_agg collection), the results always contain a next_page_token causing an infinite loop in fetching scripts.

To Reproduce

import requests
import time

# Get initial results (before next_page_token is given in the results)
collection = "functional_annotation_agg"
fields="was_generated_by"
max_page_size = 5000
filter = '{"was_generated_by":{"$regex":"^nmdc:wfmp"}}'

time_start = time.time()
result_list = []
og_url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&projection={fields}"
resp = requests.get(og_url)
initial_data = resp.json()
results = initial_data.get("resources", [])
i = 0

# append first page of results to an empty list
for result in results:
    result_list.append(result)

# if there are multiple pages of results returned
if initial_data.get("next_page_token"):
    next_page_token = initial_data["next_page_token"]

    while True:
        i = i + max_page_size
        print(str(i) + " records processed")
        url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&next_page_token={next_page_token}&projection={fields}"
        response = requests.get(url)
        data_next = response.json()

        results = data_next.get("resources", [])
        result_list.extend(results)
        next_page_token = data_next.get("next_page_token")

        if not next_page_token:
            break

print(len(result_list), "records found in", round(time.time() - time_start, 2), "seconds")

Expected behavior

With max_page_size = 0, we fetch 25118 records found in 0.51 seconds.

While pagination may take longer, I expect to retrieve the same number of records before the next_page_token is null and the loop breaks. Instead, if max_page_size = 5000, the loop never seems to break (I got it up to >100,000 records before recording this bug).

While this isn't an immediate blocker, we need some mechanism to fetch these records with pagination to sustain the aggregator.

Acceptance Criteria

Pagination on the "/nmdcschema/{collection_name}" with functional_annotation_agg endpoint should result in the same number of records as non-paginated results.

Additional context
We use this API call for gathering the IDs of the previously-aggregated workflows (see microbiomedata/nmdc-aggregator#27)

The text was updated successfully, but these errors were encountered:

kheal · 2024-12-02T16:35:22Z

@dwinston @eecavanna - not sure if I've assigned the right folks, just wanted to make sure this was going to get the attention of the appropriate people.

kheal · 2024-12-02T17:00:44Z

Testing this with a different collection and I'm seeing the same results. Honing in a bit, I don't think the next_page_token is actually giving the next page's results in this endpoint.

https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter={"type":"nmdc:MetaproteomicsAnalysis"}&max_page_size=10&projection='

returns a next_page_token of nmdc:sys08k2w6w51, but when trying to get the next page's results with this url
https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter={"type":"nmdc:MetaproteomicsAnalysis"}&next_page_token=nmdc:sys08k2w6w51&max_page_size=10&projection='

They return the exact same results. Hopefully this helps narrow down the bug. I've updated the title to match.

eecavanna · 2024-12-02T20:20:53Z

Thanks for reporting this, and for doing so this thoroughly. This does seem like a bug to me.

I think you have assigned the right people, given our current procedures. In the future, the procedure might change to: assign it to me and I'd do an initial assessment and pull in others as needed, so they can otherwise focus on whatever they've committed to for the sprint; but I'm not aware of that being the standard procedure yet (just a potential one, based upon some recent conversations).

Two things:

I'll look into this now.
The attendees of today's release planning meeting talked about this ticket. We were trying to decide whether to hold up the December 16 release for this bug. A couple of us saw that you wrote "While this isn't an immediate blocker...". It sounds to me like you have a workaround for this bug (the workaround being to not use pagination)—is that correct? If so, while we'll try to fix the bug, we won't hold up the release for it.

eecavanna · 2024-12-02T20:21:23Z

I'll remove @dwinston as an assignee for now and will reassign him with commentary after I do some digging.

eecavanna · 2024-12-02T20:34:51Z

Hi @kheal, I think the issue is that the endpoint is expecting the URL query parameter to be named page_token as opposed to next_page_token.

I see it being named next_page_token in the Python snippet in the issue description:

    while True:
        i = i + max_page_size
        print(str(i) + " records processed")
        url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&next_page_token={next_page_token}&projection={fields}"

I assume the endpoint is not seeing that token and so is always returning the first page of results (and the same next_page_token value).

Will you retry with the parameter being named page_token instead?

- url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&next_page_token={next_page_token}&projection={fields}"
+ url = f"https://api-dev.microbiomedata.org/nmdcschema/{collection}?&filter={filter}&max_page_size={max_page_size}&page_token={next_page_token}&projection={fields}"

eecavanna · 2024-12-02T20:49:34Z

I confirmed that, when I name the query parameter page_token, the API returns a different set of resources and a different next_page_token value for each "page".

$ curl -X GET 'https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter=%7B%22type%22%3A%22nmdc%3AMetaproteomicsAnalysis%22%7D&max_page_size=10&projection='
{"resources":[{"id":"nmdc:wfmp-11-1ky3j817.1", ...}, ...],"next_page_token":"nmdc:sys0dyggry73"}%

$ curl -X GET 'https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter=%7B%22type%22%3A%22nmdc%3AMetaproteomicsAnalysis%22%7D&max_page_size=10&projection=&page_token=nmdc:sys0dyggry73'
{"resources":[{"id":"nmdc:wfmp-11-6n1xme37.1", ...}, ...],"next_page_token":"nmdc:sys0jjz9ym68"}%

$ curl -X GET 'https://api-dev.microbiomedata.org/nmdcschema/workflow_execution_set?&filter=%7B%22type%22%3A%22nmdc%3AMetaproteomicsAnalysis%22%7D&max_page_size=10&projection=&page_token=nmdc:sys0jjz9ym68'
{"resources":[{"id":"nmdc:wfmp-11-f6sfn088.1", ...}, ...],"next_page_token":"nmdc:sys06714m695"}%

kheal · 2024-12-02T21:06:11Z

User error is my favorite type of bug - sorry for the unnecessary chatter. I'll test and close the issue after fixing the query filter. Maybe the real issue is that there is no feedback for sending incorrect parameter keys.

eecavanna · 2024-12-02T21:58:23Z

Thanks for including the reproduction info! I think that really reduced the time involved in identifying the root cause.

Maybe the real issue is that there is no feedback for sending incorrect parameter keys.

I'm on my way to a meeting from 2-3pm. I'll file a ticket about that after the meeting in case one doesn't exist by then.

kheal added the bug Something isn't working label Dec 2, 2024

kheal assigned dwinston and eecavanna Dec 2, 2024

kheal changed the title ~~BUG: API endpoint "/nmdcschema/{collection_name}" gets stuck in infinite loop (with functional_annotation_agg) collection~~ BUG: API endpoint "/nmdcschema/{collection_name}"'s next_page_token doesn't yield next page's results Dec 2, 2024

eecavanna changed the title ~~BUG: API endpoint "/nmdcschema/{collection_name}"'s next_page_token doesn't yield next page's results~~ BUG: API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results Dec 2, 2024

eecavanna unassigned dwinston Dec 2, 2024

eecavanna added this to 2024 - Sprint 51 - December 2 - 13, 2024 Dec 2, 2024

eecavanna moved this to In Progress in 2024 - Sprint 51 - December 2 - 13, 2024 Dec 2, 2024

eecavanna changed the title ~~BUG: API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results~~ API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results Dec 2, 2024

eecavanna moved this from In Progress to In Review in 2024 - Sprint 51 - December 2 - 13, 2024 Dec 2, 2024

kheal closed this as completed Dec 2, 2024

github-project-automation bot moved this from In Review to Done in 2024 - Sprint 51 - December 2 - 13, 2024 Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API endpoint `/nmdcschema/{collection_name}`'s `next_page_token` doesn't yield next page's results #806

API endpoint `/nmdcschema/{collection_name}`'s `next_page_token` doesn't yield next page's results #806

kheal commented Dec 2, 2024 •

edited by eecavanna

Loading

kheal commented Dec 2, 2024

kheal commented Dec 2, 2024 •

edited

Loading

eecavanna commented Dec 2, 2024

eecavanna commented Dec 2, 2024

eecavanna commented Dec 2, 2024 •

edited

Loading

eecavanna commented Dec 2, 2024

kheal commented Dec 2, 2024

eecavanna commented Dec 2, 2024

API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results #806

API endpoint /nmdcschema/{collection_name}'s next_page_token doesn't yield next page's results #806

Comments

kheal commented Dec 2, 2024 • edited by eecavanna Loading

kheal commented Dec 2, 2024

kheal commented Dec 2, 2024 • edited Loading

eecavanna commented Dec 2, 2024

eecavanna commented Dec 2, 2024

eecavanna commented Dec 2, 2024 • edited Loading

eecavanna commented Dec 2, 2024

kheal commented Dec 2, 2024

eecavanna commented Dec 2, 2024

API endpoint `/nmdcschema/{collection_name}`'s `next_page_token` doesn't yield next page's results #806

API endpoint `/nmdcschema/{collection_name}`'s `next_page_token` doesn't yield next page's results #806

kheal commented Dec 2, 2024 •

edited by eecavanna

Loading

kheal commented Dec 2, 2024 •

edited

Loading

eecavanna commented Dec 2, 2024 •

edited

Loading