Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/classes/collections/<lidvid>/members (and deprecated equivalent) hangs #231

Closed
jimmie opened this issue Jan 17, 2023 · 22 comments · Fixed by #246
Closed

/classes/collections/<lidvid>/members (and deprecated equivalent) hangs #231

jimmie opened this issue Jan 17, 2023 · 22 comments · Fixed by #246
Assignees
Labels
B13.1 bug Something isn't working s.high High severity

Comments

@jimmie
Copy link
Contributor

jimmie commented Jan 17, 2023

🐛 Describe the bug

The endpoints /classes/collections//members and /collections//products hang.

📜 To Reproduce

Steps to reproduce the behavior:

  1. curl https://pds.nasa.gov/api/search/1/collections/urn:nasa:pds:orex.ovirs:data_calibrated/products
  2. no results are returned
  3. request eventually times out w/ a gateway error

🕵️ Expected behavior

Expect a response containing the products contained by the collection to be returned, in a timely manner.

📚 Version of Software Used

registry-api 1.1.12

@jimmie jimmie self-assigned this Jan 17, 2023
@jimmie
Copy link
Contributor Author

jimmie commented Jan 17, 2023

This may be related to the latest/all issue - need to retest once provenance script is in place.

@jordanpadams
Copy link
Member

@jimmie @alexdunnjpl I think this may be the same issue that @nutjob4life identified here but let me know if that is not the case and we can create a new ticket. thanks to @nutjob4life for the thorough detailing of the issue he is seeing

@jimmie
Copy link
Contributor Author

jimmie commented Jan 26, 2023

Yeah, great summary @nutjob4life - thank you.

Confirming that after re-indexing and provenance...er...ing, this issue (i.e. the hanging) remains. Cracking open the CloudWatch logs, I'm seeing what looks to be an infinite loop, will need to examine further but it does seem to be a different issue than the issue Sean identified.

@nutjob4life
Copy link
Member

@jimmie okay, filing my issue separately from this one, thanks!

@jimmie
Copy link
Contributor Author

jimmie commented Jan 26, 2023

Good news: The request eventually completes
Bad news: After 10 minutes
Somewhat OK news: It's due to something with this collection (but not necessarily only this collection)

Regarding the last point, I was able to get results in an acceptable (IMHO) amount of time for this collection:

https://pds.nasa.gov/api/search/1/classes/collections/urn:nasa:pds:mars2020_supercam:data_calibrated_audio::4.0/members

@tloubrieu-jpl
Copy link
Member

@alexdunnjpl @jimmie could you provide an estimate on this ticket ? Thanks

@alexdunnjpl
Copy link
Contributor

First steps: point local API at prod OS instance (@jimmie to provide config/creds shortly) and attempt to replicate and enumerate the set of requests associated with the call to find a pattern.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 14, 2023

Crude log filter yields ~230 outgoing requests with identical OS queries

{"bool":{"must":[{"term":{"collection_lidvid":{"value":"urn:nasa:pds:orex.ovirs:data_calibrated::10.0","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}}

Noting this discussion as possibly-relevant

Issue appears to be that queries to OS for products with membership in the given collection are paged at a rate of 10/page, resulting in ~230 requests and a runtime of ~244sec.

~It's currently unclear what there are 2300 of, given that there are over a million products in the collection - possibly hitting some kind of timeout?

Increasing OS page size to 10000 results in runtime reduction to ~13sec and a single OS query request (down from 230).

No data was or is returned, which seems incorrect - shouldn't it return the first 100 results given no pagination queryparams from the user? This data is present in productLidvids but doesn't seem to be written into the response further up the call stack.

@jordanpadams
Copy link
Member

Status: continuing to dig a bit deeper on the performance issues in order to dig deeper into the bug

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 14, 2023

Key outstanding questions:

  • Should 100 data be returned in the no-queryparam call? (Yes)
  • Is the default HitIterator page size of 10 problematically-small, or should the page size be increased just in the context of the calls made for particular routes (in this case collections/products and classes/collections/{id}/members )? (Because the default needs to be safe wrt arbitrary OS deployments to ensure they don't crash. ~500-5000 is a sane rough limit, per @al-niessner )
  • What are the ~2200 data being requested by the hit iterator? Too few for all products, too many for the intended 100 products. (Smells like OpenSearch misconfiguration - it could be that the limit is ~2200, in which case it will be impossible to pull >~2200 products, but ALSO will be impossible to fetch beyond the ~2200th row)
  • Should the requests made to OpenSearch be mirrors or at least sane in the context of the pagination-related queryparams provided by the user? (this may already be the case, but haven't confirmed that in the context of the questionable 2200 data being retrieved)

@tloubrieu-jpl
Copy link
Member

This request directly on the opensearch node responds in acceptable time:
https://pds.nasa.gov/api/search-sbnpsi/1/classes/collections/urn:nasa:pds:orex.ovirs:data_calibrated/members

But the one going through CCS, takes longer and reach the time out limit:
https://pds.nasa.gov/api/search/1/classes/collections/urn:nasa:pds:orex.ovirs:data_calibrated/members

@tloubrieu-jpl
Copy link
Member

Next steps:

  • understand why/what are the ~2200 entities are requested at some point in the process (see comment above)
  • confirm the performance issue comes from CCS and decide on priority on this ticket over the setup of a multitenant opensearch.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 15, 2023

Turns out the 2290 entities are actually pages of lidvids with size 500.

2290*500=1145000, corresponding with the 1144754 products having membership in the collection. The page size 500 is not defined anywhere in the API codebase and does not vary with HitIterator.size

@al-niessner I know we rejected "is each hit a list of concatenated lidvids that gets expanded" but that appears to be exactly what's happening:

  1. API instantiates a HitIterator which makes calls to OpenSearch, paged by HitIterator.size
  2. Each of the elements iterated over by HitIterator is an OpenSearch SearchHit, whose "source" is a BytesArray which gets resolved or expanded by SourceLookup.sourceAsMap(this.source) and contains something like
{"product_lidvid" : ["urn:nasa:pds:orex.ovirs:data_calibrated:20190523t225311s377_ovr_scil2_calv2.fits::1.0" ...]}

I'm guessing that 500 is some OpenSearch-internal paging configuration value?

@tloubrieu-jpl If my understanding of all this is correct, I think we're good to

  • Increase the HitIterator page size to 5000 to improve performance by large margin (saves one query and about one second per 5000 (10*500) returned products compared to initial value of 10).
  • Fix the fact that no "sample page" of data is being returned by default, which iirc was the intended functionality of that endpoint when no query parameters are provided for number and starting index of results?

@al-niessner
Copy link
Contributor

@alexdunnjpl @tloubrieu-jpl

The 500 is something we pick. Now when it comes to pagination, the page is based on the product_lidvids not the hit iterator returned. The registry-api should select its page out of 1.1 million not the 2290. Does that make sense? Unfortunately, you still have to load all 2290 because we have to return the total number possible (2290*500) - some of those are not a full 500 and you do not know which.

I agree that setting it to 5000 seems just fine or the users page when doing the second look up. Does it still timeout or return nothing?

@alexdunnjpl
Copy link
Contributor

@al-niessner yeah that all scans, thanks!

No timeout (nor previously - just took ages when Jimmie was testing). It returns nothing when no query params are provides (e.g. /classes/collections/urn:nasa:pds:orex.ovirs:data_calibrated/members) but I assume will perform as expected when begin/limit query params are provided (that's next on my list to confirm for safety's sake).

The outstanding question is whether a sample first page of results should be returned in the no-args case - in the meeting yesterday @tloubrieu-jpl said yes, but I'm wondering if it's intentional (because 100% of non-dev use will need to provide those params to do anything meaningful).

@al-niessner
Copy link
Contributor

@alexdunnjpl @tloubrieu-jpl

Defaults for pagination are always start=0 limit=100 (or something like that) so it should be returning the first 100 lidvids if you did not override limit.

If you provide limit=0 or summary-only then it should not return any results but limit=0 is not or should not be the default.

@al-niessner
Copy link
Contributor

@alexdunnjpl

A while back, September when I was not here, @jimmie had to reinstate summary-only (I have strong dislike of it) because it appeared limit=0 was not working. I think I left it in place after that but may have butchered it up in the process. Ideally, it would be nice to kill summary-only and use limit=0 instead.

@alexdunnjpl
Copy link
Contributor

Gotcha, thanks Al.

@tloubrieu-jpl I'll open a PR for this issue and open a new ticket for the no-data thing.

@gxtchen
Copy link

gxtchen commented Apr 24, 2023

@tloubrieu-jpl How can I prove this is fixed? Is the same data on a test server that I can provide the query doesn't time out?

@tloubrieu-jpl
Copy link
Member

@gxtchen I believe this ticket has been fixed by an update in the OpenSearch configuration in production, so you should be able to validate that this ticket has been fixed on the production server. If not on gamma where I will deploy the latest version of the registry-api today.

@alexdunnjpl can you confirm ?

@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl @gxtchen from memory and a quick skim, testing will require performing the query against this specific collection, with an updated build of registry-api containing #246

Probably the simplest way, given that I don't believe this bundle/collection is loaded in test, is to deploy on gamma and temporarily change the registry-api configuration (application.properties) to target the prod cluster and allow testing against the prod OS db (read-only).

@tloubrieu-jpl
Copy link
Member

Thanks @alexdunnjpl , @gxtchen the latest version of the API is deployed on gamma and uses the Opensearch production server as a backend, so you can use that for tests.

For example from URL:
https://pds.nasa.gov/api/search/1/classes/collections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B13.1 bug Something isn't working s.high High severity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants