Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failures: 80_data_frame_jobs_crud #47943

Closed
nknize opened this issue Oct 11, 2019 · 11 comments
Closed

CI Failures: 80_data_frame_jobs_crud #47943

nknize opened this issue Oct 11, 2019 · 11 comments
Assignees
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI

Comments

@nknize
Copy link
Contributor

nknize commented Oct 11, 2019

These tests have failed for quite a while on 7.x:

https://groups.google.com/a/elastic.co/forum/#!searchin/build-elasticsearch/80_data_frame_jobs_crud%7Csort:date

Don't think we can mute yml tests though?

@nknize nknize added >test-failure Triaged test failures from CI :ml/Transform Transform 7x labels Oct 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@benwtrent
Copy link
Member

@nknize These failures were fixed with 0ca53bd

Looking at the latest one, it still did not yet have this commit in it. If they fail again after this commit has been included, then something else is up :/

@nknize
Copy link
Contributor Author

nknize commented Oct 11, 2019

@nknize
Copy link
Contributor Author

nknize commented Oct 11, 2019

Looks like those 5 failures are all on the 7.x branch w/ varying BWC_VERSION values. Will check if these CI runs contain that commit

@benwtrent
Copy link
Member

@nknize for sure. I will mute the tests

@benwtrent
Copy link
Member

benwtrent commented Oct 11, 2019

_data_frame/transforms/old-simple-continuous-transform/_stops times out after the previous _start call fails due to a missing index.

This shows TWO bugs, the predicate for _stop was not satisfied even though the task was indeed cancelled and removed, we are failing to create a checkpoint in a mixed cluster.

The first bug has probably a couple of causes:

The error trace (when running in a mixed 7.3.0, 7.5.0 cluster):

[o.e.x.t.t.ClientTransformIndexer] [v7.3.0-0] [old-simple-continuous-transform] failed to create checkpoint.
»  org.elasticsearch.index.IndexNotFoundException: no such index [.transform-internal-003]
»       at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:194) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteWriteIndex(IndexNameExpressionResolver.java:304) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$ConcreteIndices.resolveIfAbsent(TransportBulkAction.java:652) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.addFailureIfIndexIsUnavailable(TransportBulkAction.java:606) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:449) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:632) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$1.lambda$onResponse$0(TransportBulkAction.java:246) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
»       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
»       at java.lang.Thread.run(Thread.java:834) [?:?]
» WARN ][o.e.x.t.t.ClientTransformIndexer] [v7.3.0-0] [old-simple-continuous-transform] transform encountered an exception:
»  java.lang.RuntimeException: Failed to create checkpoint due to no such index [.transform-internal-003]
»       at org.elasticsearch.xpack.transform.transforms.ClientTransformIndexer.lambda$createCheckpoint$15(ClientTransformIndexer.java:502) [transform-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:79) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportSingleItemBulkWriteAction.lambda$wrapBulkResponse$0(TransportSingleItemBulkWriteAction.java:64) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:502) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:632) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$1.lambda$onResponse$0(TransportBulkAction.java:246) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
»       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
»       at java.lang.Thread.run(Thread.java:834) [?:?]
»  Caused by: org.elasticsearch.index.IndexNotFoundException: no such index [.transform-internal-003]
»       at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:194) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteWriteIndex(IndexNameExpressionResolver.java:304) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$ConcreteIndices.resolveIfAbsent(TransportBulkAction.java:652) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.addFailureIfIndexIsUnavailable(TransportBulkAction.java:606) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:449) ~[elasticsearch-7.5.0-SNAPSHOT.jar:7.5.0-SNAPSHOT]
»       ... 7 more

This occurred in the oneThirdUpgrade test.

Don't let the trace fool you, this 7.3.0-0 was upgraded to 7.5.0 and joined the cluster. The current master node was a 7.3.0 node.

The transform task was then assigned to the new 7.5.0 node and it attempted to create a new checkpoint since it is attempting to _start with no previous checkpoint data.

@benwtrent
Copy link
Member

The task is DEFINITELY cancelled and removed as I see these logs later:

[2019-10-11T15:15:48,379][INFO ][o.e.x.t.t.TransformTask  ] [v7.3.0-0] [old-simple-continuous-transform] received cancellation request for transform, state: [STARTED].
[2019-10-11T15:15:48,382][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [v7.3.0-0] Received ban for the parent [r1mDHgeDQ-u1owgSdVIdjA:73] on the node [r1mDHgeDQ-u1owgSdVIdjA], reason: [task has been removed, cancelling locally]
[2019-10-11T15:15:48,387][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [v7.3.0-0] Sending remove ban for tasks with the parent [r1mDHgeDQ-u1owgSdVIdjA:73] to the node [r1mDHgeDQ-u1owgSdVIdjA]
[2019-10-11T15:15:48,387][DEBUG][o.e.c.c.JoinHelper       ] [v7.3.0-0] successfully joined {v7.3.0-2}{NeeUDuSbR3--jpYZGJPaXA}{3-mX42K4R_qhwhfUy_sgCA}{127.0.0.1}{127.0.0.1:50966}{dim}{testattr=test, ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={v7.3.0-0}{r1mDHgeDQ-u1owgSdVIdjA}{37A3FxaURO6E0pbW0ToOeQ}{127.0.0.1}{127.0.0.1:51077}{dilm}{ml.machine_memory=17179869184, upgraded=true, xpack.installed=true, testattr=test, ml.max_open_jobs=20}, optionalJoin=Optional.empty}
[2019-10-11T15:15:48,389][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [v7.3.0-0] Removing ban for the parent [r1mDHgeDQ-u1owgSdVIdjA:73] on the node [r1mDHgeDQ-u1owgSdVIdjA]
[2019-10-11T15:15:48,389][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [v7.3.0-0] Sending remove ban for tasks with the parent [r1mDHgeDQ-u1owgSdVIdjA:73] to the node [NeeUDuSbR3--jpYZGJPaXA]
[2019-10-11T15:15:48,395][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [v7.3.0-0] Sending remove ban for tasks with the parent [r1mDHgeDQ-u1owgSdVIdjA:73] to the node [kifkXVqZQDilOZAH5T3Ctw]

Also, there MAY be another funky issue as I see this in the logs too (from the old nodes):

t io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-10-11T15:16:16,024][WARN ][o.e.a.b.TransportShardBulkAction] [v7.3.0-1] [[.transform-notifications-000001][0]] failed to perform indices:data/write/bulk[s] on replica [.transform-notifications-000001][0], node[r1mDHgeDQ-u1owgSdVIdjA], [R], s[STARTED], a[id=Zw3q_n_UQNSeoNzsSiTyTg]
org.elasticsearch.transport.NodeDisconnectedException: [v7.3.0-0][127.0.0.1:51077][indices:data/write/bulk[s][r]] disconnected
[2019-10-11T15:16:16,028][WARN ][o.e.a.b.TransportShardBulkAction] [v7.3.0-1] [[.transform-notifications-000001][0]] failed to perform indices:data/write/bulk[s] on replica [.transform-notifications-000001][0], node[r1mDHgeDQ-u1owgSdVIdjA], [R], s[STARTED], a[id=Zw3q_n_UQNSeoNzsSiTyTg]
org.elasticsearch.transport.NodeDisconnectedException: [v7.3.0-0][127.0.0.1:51077][indices:data/write/bulk[s][r]] disconnected

@hendrikmuhs
Copy link

Don't let the trace fool you, this 7.3.0-0 was upgraded to 7.5.0 and joined the cluster. The current master node was a 7.3.0 node.

This explains the missing index: The Template upgrader only runs on the master node, if the master node was 7.3 it doesn't know about the new template.

The issue however exists already between (7.2, 7.3) and 7.4, because in 7.4 we changed the index name, too. But some tests do exist or are blacklisted for older versions, so maybe we were "lucky".

The situation is odd, I see 2 solutions:

A do not run the task on a node newer than master (a flavor of that could be a minimum_compat_version check, assuming we do not introduce a new internal index with every minor)
B create the template by the non-master node, e.g. at the start of the transform

@droberts195
Copy link
Contributor

create the template by the non-master node, e.g. at the start of the transform

#46553 was supposed to do this. There must be some situation where it doesn’t do it quickly enough or its attempt to create the template fails.

hendrikmuhs pushed a commit that referenced this issue Oct 15, 2019
add alias for backwards compatibility with 7.4

relates #47943
hendrikmuhs pushed a commit that referenced this issue Oct 21, 2019
blacklist transform tests in mixed cluster prior 7.4, see #48019

fixes #47943
hendrikmuhs pushed a commit that referenced this issue Oct 21, 2019
blacklist transform tests in mixed cluster prior 7.4, see #48019

fixes #47943
@hendrikmuhs
Copy link

fixed with #48247

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants