Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TransformIT.testStopWaitForCheckpoint #70416

Closed
davidkyle opened this issue Mar 15, 2021 · 2 comments · Fixed by #70461
Closed

[CI] TransformIT.testStopWaitForCheckpoint #70416

davidkyle opened this issue Mar 15, 2021 · 2 comments · Fixed by #70461
Assignees
Labels
:ml/Transform Transform Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/dharwo34gzsem

Repro line:

./gradlew ':x-pack:plugin:transform:qa:multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.transform.integration.TransformIT.testStopWaitForCheckpoint" \
  -Dtests.seed=733ADC0CFE2A526 \
  -Dtests.security.manager=true \
  -Dtests.locale=und \
  -Dtests.timezone=America/Indiana/Vincennes \
  -Druntime.java=11 \
  -Dtests.fips.enabled=true

Reproduces locally?:
No

Applicable branches:
7.x

Failure history:
Rare

Failure excerpt:

ElasticsearchStatusException[Elasticsearch exception [type=status_exception, reason=Failed to update transform task [transform-wait-for-checkpoint] state value should_stop_at_checkpoint from [true] to [true]]]

I noticed in the logs

java.lang.RuntimeException: Failed to persist transform statistics for transform [transform-wait-for-checkpoint]
	at org.elasticsearch.xpack.transform.persistence.IndexBasedTransformConfigManager.lambda$putOrUpdateTransformStoredDoc$14(IndexBasedTransformConfigManager.java:528) [transform-7.13.0-SNAPSHOT.jar:7.13.0-SNAPSHOT]

....
Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [data_frame_transform_state_and_stats-transform-wait-for-checkpoint]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [11] and primary term [1]

Log files uploaded here

@davidkyle davidkyle added >test-failure Triaged test failures from CI :ml/Transform Transform labels Mar 15, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@hendrikmuhs hendrikmuhs self-assigned this Mar 16, 2021
@hendrikmuhs
Copy link

Another tricky one, from the logs I think that a request for should_stop_at_checkpoint came in while the indexer was shutting down. I recently added safe guards for this situation for the trigger method, it seems this is another place.

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Mar 16, 2021
… stops

if stopAtCheckpoint has been called in between. This change also fixes a
logging problem and ensures a timeout error gets logged.

fixes elastic#70416
hendrikmuhs pushed a commit that referenced this issue Mar 22, 2021
shouldStopAtCheckpoint tells transform to stop at the next checkpoint, if
this API is called while a checkpoint is finishing, it can cause a race condition
in state persistence. This is similar to #69551, but this time in a different
place.

With this change _stop?shouldStopAtCheckpoint=true does not call doSaveState
if indexer is shutting down. Still it ensures the job stops after the indexer has
shutdown. Apart from that the change fixes: a logging problem, it adds error
handling in case of a timeout during _stop?shouldStopAtCheckpoint=true. Some
logic has been moved from the task to the indexer.

fixes #70416
hendrikmuhs pushed a commit that referenced this issue Mar 22, 2021
shouldStopAtCheckpoint tells transform to stop at the next checkpoint, if
this API is called while a checkpoint is finishing, it can cause a race condition
in state persistence. This is similar to #69551, but this time in a different
place.

With this change _stop?shouldStopAtCheckpoint=true does not call doSaveState
if indexer is shutting down. Still it ensures the job stops after the indexer has
shutdown. Apart from that the change fixes: a logging problem, it adds error
handling in case of a timeout during _stop?shouldStopAtCheckpoint=true. Some
logic has been moved from the task to the indexer.

fixes #70416
hendrikmuhs pushed a commit that referenced this issue Apr 6, 2021
…71343)

shouldStopAtCheckpoint tells transform to stop at the next checkpoint, if
this API is called while a checkpoint is finishing, it can cause a race condition
in state persistence. This is similar to #69551, but this time in a different
place.

With this change _stop?shouldStopAtCheckpoint=true does not call doSaveState
if indexer is shutting down. Still it ensures the job stops after the indexer has
shutdown. Apart from that the change fixes: a logging problem, it adds error
handling in case of a timeout during _stop?shouldStopAtCheckpoint=true. Some
logic has been moved from the task to the indexer.

fixes #70416
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants