-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploy repairkit sweeper to delta and prod #61
Comments
@alexdunnjpl are we just deploying a new registry-sweeper image to prod? Or are there other steps that need to be completed for this task? |
@sjoshi-jpl yeah, just a standard ad-hoc redeployment, then checking to make sure it executes successfully in prod I'll push the image now |
Image is pushed, @sjoshi-jpl to confirm that tasks successfully execute. @sjoshi-jpl do we already have a deployment targeting delta OpenSearch, or just prod? |
@alexdunnjpl @tloubrieu-jpl after running tasks multiples times for each domain, here are the findings :
@alexdunnjpl right now we're not running anything for the delta cluster, we could create a task definition with the newly pushed image to test in delta. Does this answer your question? |
Some of node take too long to be processed (img, psi, rms) |
Update -
|
@nutjob4life can you chat with @sjoshi-jpl and try to help debug the 504 issues he is seeing on those 2 registries? |
@jordanpadams will do. @sjoshi-jpl, I'll hit you up on Slack |
FYI, met with @sjoshi-jpl to try and debug and brainstorm what's going on here. We decided to use one image previous with ATM and GEO (and although these images were untagged in ECR, thankfully they had unique URIs, and the AWS task def service lets you specify an image by URI) and manually launched the sweepers for these two nodes. And they worked fine. So the issue seems to be related to the RepairKit additions. I'm going to be reviewing those commits with a closer eye. |
@nutjob4life I'm like... >80% sure that the issue would be resolved by streaming updates through the bulk write call and letting the write function handle flushing the writes rather than making one bulk write call per doc update. That, however, requires an update to the interface of that function - it really should take an iterable of update objects/dicts, to allow one to throw a lazy/generator expression at it. Minor changes to the other two sweepers will be necessary to reflect such a change, which is why I didn't just do it as a quick addendum to #54 Happy to take that on if that's easier as I'm waiting on comms for my other high-priority ticket. |
@nutjob4life @alexdunnjpl Since last week the PSA node was throwing CPU/Memory alerts and consuming most of the allocated compute for the task. I increased it from 1vCPU / 4GB to 2 vCPU and 16GB but the memory utilization is still over 95%. |
@nutjob4life tried opensearch python bulk api without success. |
Per conversation with team yesterday, @alexdunnjpl @nutjob4life will be implementing bulk update changes after which we will need to re-test all nodes to ensure the issues with ATM, GEO and PSA are resolved. |
Update: After testing bulk-update, ATM node is completing successfully. PSA - still needs 4vCPU and 30GB RAM to complete |
I've opened DSIO # 4457 to enable slow logs in order to help further troubleshooting. |
|
#70 is going to be the solution for this ticket |
Some error remains, @sjoshi-jpl and @alexdunnjpl will discuss that. |
Remaining errors are due too lack of resources on ECS. |
Clarification: ATM/GEO errors are suspected to be due to insufficient ECS instance sizing. @sjoshi-jpl has submitted an SA ticket to resize, SAs have actioned, results should be available by COB today. |
GEO (and probably ATM - need to confirm) errors have been narrowed down to the fact that the documents are huge compared to other nodes - 1000 docs returns ~45MB, so the default page size of 10000 docs causes internal overflows.
There are a few potential options for resolution:
I'll implement option 1 after testing properly against GEO and ATM, and we can re-visit later if additional flexibility is needed. |
@sjoshi-jpl initial run of the sweepers against a 2M-product database should be on the order of 4hrs, says my napkin, so expect a period of container execution timeout failures. They should resolve by tomorrow or the next day, though. |
💡 Description
Refs NASA-PDS/registry-api#349
The text was updated successfully, but these errors were encountered: