Deploy repairkit sweeper to delta and prod #61

jordanpadams · 2023-08-16T16:56:40Z

💡 Description

sjoshi-jpl · 2023-08-16T18:37:50Z

@alexdunnjpl are we just deploying a new registry-sweeper image to prod? Or are there other steps that need to be completed for this task?

alexdunnjpl · 2023-08-16T22:03:26Z

@sjoshi-jpl yeah, just a standard ad-hoc redeployment, then checking to make sure it executes successfully in prod

I'll push the image now

alexdunnjpl · 2023-08-16T22:12:31Z

Image is pushed, @sjoshi-jpl to confirm that tasks successfully execute.

@sjoshi-jpl do we already have a deployment targeting delta OpenSearch, or just prod?

sjoshi-jpl · 2023-08-17T19:51:32Z

@alexdunnjpl @tloubrieu-jpl after running tasks multiples times for each domain, here are the findings :

ATM and GEO nodes are timing out with 504 Gateway Error (even after multiple tries).
IMG node took 1 hr 54 mins to complete.
SBNPSI and RMS nodes are running for over 2 hours. They're both not getting past repairkit step.
All other nodes are completing within 1 hour time window without errors.

@alexdunnjpl right now we're not running anything for the delta cluster, we could create a task definition with the newly pushed image to test in delta. Does this answer your question?

tloubrieu-jpl · 2023-08-17T21:39:49Z

Some of node take too long to be processed (img, psi, rms)

sjoshi-jpl · 2023-08-18T19:43:54Z

Update -

ATM / GEO still returning 504 errors. ATM has an issue with missing ScrollId.
IMG running for close to 2 hours.
All other tasks completed in under 1 hour.

jordanpadams · 2023-08-18T20:40:10Z

@nutjob4life can you chat with @sjoshi-jpl and try to help debug the 504 issues he is seeing on those 2 registries?

nutjob4life · 2023-08-21T19:33:15Z

@jordanpadams will do. @sjoshi-jpl, I'll hit you up on Slack

nutjob4life · 2023-08-22T18:31:18Z

FYI, met with @sjoshi-jpl to try and debug and brainstorm what's going on here. We decided to use one image previous with ATM and GEO (and although these images were untagged in ECR, thankfully they had unique URIs, and the AWS task def service lets you specify an image by URI) and manually launched the sweepers for these two nodes.

And they worked fine. So the issue seems to be related to the RepairKit additions. I'm going to be reviewing those commits with a closer eye.

alexdunnjpl · 2023-08-22T19:14:30Z

@nutjob4life I'm like... >80% sure that the issue would be resolved by streaming updates through the bulk write call and letting the write function handle flushing the writes rather than making one bulk write call per doc update.

That, however, requires an update to the interface of that function - it really should take an iterable of update objects/dicts, to allow one to throw a lazy/generator expression at it. Minor changes to the other two sweepers will be necessary to reflect such a change, which is why I didn't just do it as a quick addendum to #54

Happy to take that on if that's easier as I'm waiting on comms for my other high-priority ticket.

sjoshi-jpl · 2023-08-29T16:28:20Z

@nutjob4life @alexdunnjpl Since last week the PSA node was throwing CPU/Memory alerts and consuming most of the allocated compute for the task. I increased it from 1vCPU / 4GB to 2 vCPU and 16GB but the memory utilization is still over 95%.

tloubrieu-jpl · 2023-08-29T20:15:59Z

@nutjob4life tried opensearch python bulk api without success.

sjoshi-jpl · 2023-08-30T16:36:02Z

Per conversation with team yesterday, @alexdunnjpl @nutjob4life will be implementing bulk update changes after which we will need to re-test all nodes to ensure the issues with ATM, GEO and PSA are resolved.

sjoshi-jpl · 2023-09-05T16:48:40Z

Update:

After testing bulk-update, ATM node is completing successfully.

PSA - still needs 4vCPU and 30GB RAM to complete
GEO - running for longer than 3 hours. Had to stagger the task to run every 5 hours to be able to complete.

sjoshi-jpl · 2023-09-05T20:20:20Z

I've opened DSIO # 4457 to enable slow logs in order to help further troubleshooting.

jordanpadams · 2023-09-05T20:36:27Z

504 errors most likely due to query timeouts
recommends update for repairkit for adding repairkit version to metadata as part of repairkit run

tloubrieu-jpl · 2023-09-07T21:07:32Z

#70 is going to be the solution for this ticket

tloubrieu-jpl · 2023-09-12T20:19:49Z

Some error remains, @sjoshi-jpl and @alexdunnjpl will discuss that.

tloubrieu-jpl · 2023-09-14T21:05:10Z

Remaining errors are due too lack of resources on ECS.

alexdunnjpl · 2023-09-14T21:14:47Z

Clarification: ATM/GEO errors are suspected to be due to insufficient ECS instance sizing. @sjoshi-jpl has submitted an SA ticket to resize, SAs have actioned, results should be available by COB today.

alexdunnjpl · 2023-09-19T17:14:38Z

GEO (and probably ATM - need to confirm) errors have been narrowed down to the fact that the documents are huge compared to other nodes - 1000 docs returns ~45MB, so the default page size of 10000 docs causes internal overflows.

[2023-09-19T16:29:46,799][WARN ][r.suppressed             ] [2a6f484c833c0bd8c7f96d4b9c4475f6] path: __PATH__ params: {size=10000, scroll=10m, index=registry, _source_excludes=, _source_includes=}
java.lang.ArithmeticException: integer overflow
	at __PATH__(Math.java:909)
	at org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:618)
	at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)
	at org.opensearch.common.bytes.BytesArray.<init>(BytesArray.java:50)
	at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:86)
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
	at org.opensearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:518)
	at org.opensearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:50)
	at org.opensearch.rest.action.RestActionListener.onResponse(RestActionListener.java:60)
	at org.opensearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:110)
	at org.opensearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:104)
	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:103)
	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:97)
	at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionListener.onResponse(PerformanceAnalyzerActionListener.java:76)
	at org.opensearch.action.support.TimeoutTaskCancellationUtility$TimeoutRunnableListener.onResponse(TimeoutTaskCancellationUtility.java:106)
	at org.opensearch.action.ActionListener$5.onResponse(ActionListener.java:262)
	at org.opensearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:574)
	at org.opensearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:132)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:377)
	at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:371)
	at org.opensearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:243)
	at org.opensearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:125)
	at org.opensearch.action.search.CountedCollector.countDown(CountedCollector.java:64)
	at org.opensearch.action.search.ArraySearchPhaseResults.consumeResult(ArraySearchPhaseResults.java:59)
	at org.opensearch.action.search.CountedCollector.onResult(CountedCollector.java:72)
	at org.opensearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:195)
	at org.opensearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:190)
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:58)
	at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:42)
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:67)
	at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:413)
	at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:658)
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:306)
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1207)
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:266)
	at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:258)
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:146)
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102)
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:713)
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:155)
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:130)
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:95)
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:87)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1533)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1282)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1329)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at __PATH__(Thread.java:829)

There are a few potential options for resolution:

Size repairkit scroll page size according to the big-document node constraints. This will slow down all sweepers in theory, but it shouldn't be an issue as it only affects the work done on products harvested since the last sweepers run. Near-zero implementation effort/time.
Incorporate dynamic page sizing into the retry backoff. This would improve resilience, but introduces some potential for future confusion when a dev thinks that 10k-size pages are being requested but it's dynamically adjusting that under the hood and chaining them together. This shouldn't be a first-resort imho.
add MAX_FULL_DOC_REQUEST_COUNT or similar as an env var or CLI argument, which would (if present) constrain the page size for relevant sweepers. It would allow for more targeted constraint than the first option, but would add a little complexity and require a little dev effort to do cleanly, which I think might not be justified by the theoretical benefit over the first option.

I'll implement option 1 after testing properly against GEO and ATM, and we can re-visit later if additional flexibility is needed.

alexdunnjpl · 2023-09-19T17:17:45Z

@sjoshi-jpl initial run of the sweepers against a 2M-product database should be on the order of 4hrs, says my napkin, so expect a period of container execution timeout failures. They should resolve by tomorrow or the next day, though.

jordanpadams added B14.0 task i&t.skip Skip I&T of this task/ticket labels Aug 16, 2023

jordanpadams changed the title ~~Deploy repairkit sweepers to delta and prod~~ Deploy repairkit sweeper to delta and prod Aug 16, 2023

jordanpadams transferred this issue from NASA-PDS/registry-api Aug 16, 2023

jordanpadams assigned alexdunnjpl and sjoshi-jpl Aug 16, 2023

jordanpadams added the sprint-backlog label Aug 16, 2023

jordanpadams assigned nutjob4life Aug 18, 2023

jordanpadams mentioned this issue Aug 18, 2023

Setup OIDC Authentication for MCP NASA-PDS/registry#222

Closed

alexdunnjpl mentioned this issue Aug 30, 2023

improve repairkit sweeper performance #62

Merged

alexdunnjpl mentioned this issue Sep 6, 2023

improve log message on empty query hits #71

Merged

alexdunnjpl mentioned this issue Sep 12, 2023

Implement sweeper data versioning for repairkit #73

Merged

alexdunnjpl mentioned this issue Sep 19, 2023

ensure repairkit compatibility with large-doc nodes #77

Merged

alexdunnjpl closed this as completed in #77 Sep 19, 2023

jordanpadams removed the sprint-backlog label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy repairkit sweeper to delta and prod #61

Deploy repairkit sweeper to delta and prod #61

jordanpadams commented Aug 16, 2023

sjoshi-jpl commented Aug 16, 2023

alexdunnjpl commented Aug 16, 2023 •

edited

Loading

alexdunnjpl commented Aug 16, 2023

sjoshi-jpl commented Aug 17, 2023 •

edited

Loading

tloubrieu-jpl commented Aug 17, 2023

sjoshi-jpl commented Aug 18, 2023

jordanpadams commented Aug 18, 2023

nutjob4life commented Aug 21, 2023

nutjob4life commented Aug 22, 2023

alexdunnjpl commented Aug 22, 2023

sjoshi-jpl commented Aug 29, 2023

tloubrieu-jpl commented Aug 29, 2023

sjoshi-jpl commented Aug 30, 2023

sjoshi-jpl commented Sep 5, 2023 •

edited by jordanpadams

Loading

sjoshi-jpl commented Sep 5, 2023

jordanpadams commented Sep 5, 2023 •

edited

Loading

tloubrieu-jpl commented Sep 7, 2023

tloubrieu-jpl commented Sep 12, 2023

tloubrieu-jpl commented Sep 14, 2023

alexdunnjpl commented Sep 14, 2023

alexdunnjpl commented Sep 19, 2023

alexdunnjpl commented Sep 19, 2023

Deploy repairkit sweeper to delta and prod #61

Deploy repairkit sweeper to delta and prod #61

Comments

jordanpadams commented Aug 16, 2023

💡 Description

sjoshi-jpl commented Aug 16, 2023

alexdunnjpl commented Aug 16, 2023 • edited Loading

alexdunnjpl commented Aug 16, 2023

sjoshi-jpl commented Aug 17, 2023 • edited Loading

tloubrieu-jpl commented Aug 17, 2023

sjoshi-jpl commented Aug 18, 2023

jordanpadams commented Aug 18, 2023

nutjob4life commented Aug 21, 2023

nutjob4life commented Aug 22, 2023

alexdunnjpl commented Aug 22, 2023

sjoshi-jpl commented Aug 29, 2023

tloubrieu-jpl commented Aug 29, 2023

sjoshi-jpl commented Aug 30, 2023

sjoshi-jpl commented Sep 5, 2023 • edited by jordanpadams Loading

sjoshi-jpl commented Sep 5, 2023

jordanpadams commented Sep 5, 2023 • edited Loading

tloubrieu-jpl commented Sep 7, 2023

tloubrieu-jpl commented Sep 12, 2023

tloubrieu-jpl commented Sep 14, 2023

alexdunnjpl commented Sep 14, 2023

alexdunnjpl commented Sep 19, 2023

alexdunnjpl commented Sep 19, 2023

alexdunnjpl commented Aug 16, 2023 •

edited

Loading

sjoshi-jpl commented Aug 17, 2023 •

edited

Loading

sjoshi-jpl commented Sep 5, 2023 •

edited by jordanpadams

Loading

jordanpadams commented Sep 5, 2023 •

edited

Loading