Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy repairkit sweeper to delta and prod #61

Closed
jordanpadams opened this issue Aug 16, 2023 · 22 comments · Fixed by #77
Closed

Deploy repairkit sweeper to delta and prod #61

jordanpadams opened this issue Aug 16, 2023 · 22 comments · Fixed by #77
Assignees
Labels
B14.0 i&t.skip Skip I&T of this task/ticket task

Comments

@jordanpadams
Copy link
Member

💡 Description

Refs NASA-PDS/registry-api#349

@jordanpadams jordanpadams added B14.0 task i&t.skip Skip I&T of this task/ticket labels Aug 16, 2023
@jordanpadams jordanpadams changed the title Deploy repairkit sweepers to delta and prod Deploy repairkit sweeper to delta and prod Aug 16, 2023
@jordanpadams jordanpadams transferred this issue from NASA-PDS/registry-api Aug 16, 2023
@sjoshi-jpl
Copy link
Contributor

@alexdunnjpl are we just deploying a new registry-sweeper image to prod? Or are there other steps that need to be completed for this task?

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Aug 16, 2023

@sjoshi-jpl yeah, just a standard ad-hoc redeployment, then checking to make sure it executes successfully in prod

I'll push the image now

@alexdunnjpl
Copy link
Contributor

Image is pushed, @sjoshi-jpl to confirm that tasks successfully execute.

@sjoshi-jpl do we already have a deployment targeting delta OpenSearch, or just prod?

@sjoshi-jpl
Copy link
Contributor

sjoshi-jpl commented Aug 17, 2023

@alexdunnjpl @tloubrieu-jpl after running tasks multiples times for each domain, here are the findings :

  1. ATM and GEO nodes are timing out with 504 Gateway Error (even after multiple tries).
  2. IMG node took 1 hr 54 mins to complete.
  3. SBNPSI and RMS nodes are running for over 2 hours. They're both not getting past repairkit step.
  4. All other nodes are completing within 1 hour time window without errors.

@alexdunnjpl right now we're not running anything for the delta cluster, we could create a task definition with the newly pushed image to test in delta. Does this answer your question?

@tloubrieu-jpl
Copy link
Member

Some of node take too long to be processed (img, psi, rms)

@sjoshi-jpl
Copy link
Contributor

Update -

  1. ATM / GEO still returning 504 errors. ATM has an issue with missing ScrollId.
  2. IMG running for close to 2 hours.
  3. All other tasks completed in under 1 hour.

@jordanpadams
Copy link
Member Author

@nutjob4life can you chat with @sjoshi-jpl and try to help debug the 504 issues he is seeing on those 2 registries?

@nutjob4life
Copy link
Member

@jordanpadams will do. @sjoshi-jpl, I'll hit you up on Slack

@nutjob4life
Copy link
Member

FYI, met with @sjoshi-jpl to try and debug and brainstorm what's going on here. We decided to use one image previous with ATM and GEO (and although these images were untagged in ECR, thankfully they had unique URIs, and the AWS task def service lets you specify an image by URI) and manually launched the sweepers for these two nodes.

And they worked fine. So the issue seems to be related to the RepairKit additions. I'm going to be reviewing those commits with a closer eye.

@alexdunnjpl
Copy link
Contributor

@nutjob4life I'm like... >80% sure that the issue would be resolved by streaming updates through the bulk write call and letting the write function handle flushing the writes rather than making one bulk write call per doc update.

That, however, requires an update to the interface of that function - it really should take an iterable of update objects/dicts, to allow one to throw a lazy/generator expression at it. Minor changes to the other two sweepers will be necessary to reflect such a change, which is why I didn't just do it as a quick addendum to #54

Happy to take that on if that's easier as I'm waiting on comms for my other high-priority ticket.

@sjoshi-jpl
Copy link
Contributor

@nutjob4life @alexdunnjpl Since last week the PSA node was throwing CPU/Memory alerts and consuming most of the allocated compute for the task. I increased it from 1vCPU / 4GB to 2 vCPU and 16GB but the memory utilization is still over 95%.

@tloubrieu-jpl
Copy link
Member

@nutjob4life tried opensearch python bulk api without success.

@sjoshi-jpl
Copy link
Contributor

Per conversation with team yesterday, @alexdunnjpl @nutjob4life will be implementing bulk update changes after which we will need to re-test all nodes to ensure the issues with ATM, GEO and PSA are resolved.

@sjoshi-jpl
Copy link
Contributor

sjoshi-jpl commented Sep 5, 2023

Update:

After testing bulk-update, ATM node is completing successfully.

PSA - still needs 4vCPU and 30GB RAM to complete
GEO - running for longer than 3 hours. Had to stagger the task to run every 5 hours to be able to complete.

@sjoshi-jpl
Copy link
Contributor

I've opened DSIO # 4457 to enable slow logs in order to help further troubleshooting.

@jordanpadams
Copy link
Member Author

jordanpadams commented Sep 5, 2023

  • 504 errors most likely due to query timeouts
  • recommends update for repairkit for adding repairkit version to metadata as part of repairkit run

@tloubrieu-jpl
Copy link
Member

#70 is going to be the solution for this ticket

@tloubrieu-jpl
Copy link
Member

Some error remains, @sjoshi-jpl and @alexdunnjpl will discuss that.

@tloubrieu-jpl
Copy link
Member

Remaining errors are due too lack of resources on ECS.

@alexdunnjpl
Copy link
Contributor

Clarification: ATM/GEO errors are suspected to be due to insufficient ECS instance sizing. @sjoshi-jpl has submitted an SA ticket to resize, SAs have actioned, results should be available by COB today.

@alexdunnjpl
Copy link
Contributor

GEO (and probably ATM - need to confirm) errors have been narrowed down to the fact that the documents are huge compared to other nodes - 1000 docs returns ~45MB, so the default page size of 10000 docs causes internal overflows.

[2023-09-19T16:29:46,799][WARN ][r.suppressed             ] [2a6f484c833c0bd8c7f96d4b9c4475f6] path: __PATH__ params: {size=10000, scroll=10m, index=registry, _source_excludes=, _source_includes=}
java.lang.ArithmeticException: integer overflow
	at __PATH__(Math.java:909)
	at org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:618)
	at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)
	at org.opensearch.common.bytes.BytesArray.<init>(BytesArray.java:50)
	at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:86)
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
	at org.opensearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:518)
	at org.opensearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:50)
	at org.opensearch.rest.action.RestActionListener.onResponse(RestActionListener.java:60)
	at org.opensearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:110)
	at org.opensearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:104)
	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:103)
	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:97)
	at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionListener.onResponse(PerformanceAnalyzerActionListener.java:76)
	at org.opensearch.action.support.TimeoutTaskCancellationUtility$TimeoutRunnableListener.onResponse(TimeoutTaskCancellationUtility.java:106)
	at org.opensearch.action.ActionListener$5.onResponse(ActionListener.java:262)
	at org.opensearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:574)
	at org.opensearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:132)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:377)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:371)
	at org.opensearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:243)
	at org.opensearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:125)
	at org.opensearch.action.search.CountedCollector.countDown(CountedCollector.java:64)
	at org.opensearch.action.search.ArraySearchPhaseResults.consumeResult(ArraySearchPhaseResults.java:59)
	at org.opensearch.action.search.CountedCollector.onResult(CountedCollector.java:72)
	at org.opensearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:195)
	at org.opensearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:190)
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:58)
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:42)
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:67)
	at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:413)
	at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:658)
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:306)
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1207)
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:266)
	at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:258)
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:146)
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102)
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:713)
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:155)
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:130)
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:95)
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:87)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1533)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1282)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1329)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at __PATH__(Thread.java:829)

There are a few potential options for resolution:

  1. Size repairkit scroll page size according to the big-document node constraints. This will slow down all sweepers in theory, but it shouldn't be an issue as it only affects the work done on products harvested since the last sweepers run. Near-zero implementation effort/time.

  2. Incorporate dynamic page sizing into the retry backoff. This would improve resilience, but introduces some potential for future confusion when a dev thinks that 10k-size pages are being requested but it's dynamically adjusting that under the hood and chaining them together. This shouldn't be a first-resort imho.

  3. add MAX_FULL_DOC_REQUEST_COUNT or similar as an env var or CLI argument, which would (if present) constrain the page size for relevant sweepers. It would allow for more targeted constraint than the first option, but would add a little complexity and require a little dev effort to do cleanly, which I think might not be justified by the theoretical benefit over the first option.

I'll implement option 1 after testing properly against GEO and ATM, and we can re-visit later if additional flexibility is needed.

@alexdunnjpl
Copy link
Contributor

@sjoshi-jpl initial run of the sweepers against a 2M-product database should be on the order of 4hrs, says my napkin, so expect a period of container execution timeout failures. They should resolve by tomorrow or the next day, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.0 i&t.skip Skip I&T of this task/ticket task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants