Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] fixes testWatchdog test verifying matcher is interrupted on timeout #62391

Merged
merged 5 commits into from
Sep 16, 2020

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Sep 15, 2020

Constructing the timout checker FIRST and THEN registering the watcher allows the test to have a race condition.

The timeout value could be reached BEFORE the matcher is added. To prevent the matcher never being interrupted, a new timedOut value is added to the watcher thread entry. Then when a new matcher is registered, if the thread was previously timedout, we interrupt the matcher immediately.

closes #48861

@benwtrent benwtrent added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v7.10.0 labels Sep 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@droberts195
Copy link
Contributor

I think the problem you've identified could affect the production code as well as the test code. But you've fixed the test by introducing a new code path that works well for the test. So the production code is left with a flaw while the test starts working.

It seems like a solution that should work both for testing and production is to say that if a matcher is registered after a timeout then it's instantly interrupted, i.e. change:

        @Override
        public void register(Matcher matcher) {
            WatchDogEntry value = registry.get(Thread.currentThread());
            if (value != null) {
                boolean wasFalse = value.registered.compareAndSet(false, true);
                assert wasFalse;
                value.matchers.add(matcher);
            }
        }

to something like:

        @Override
        public void register(Matcher matcher) {
            WatchDogEntry value = registry.get(Thread.currentThread());
            if (value != null) {
                synchronized (TimeoutChecker.this) {
                    boolean wasFalse = value.registered.compareAndSet(false, true);
                    assert wasFalse;
                    value.matchers.add(matcher);
                    if (TimeoutChecker.this.timeoutExceeded) {
                        matcher.interrupt();
                    }
                }
            }
        }

Then the test can stay more-or-less as it was, but the timeout should be a random value between 10 and 500 to increase the chance of testing both code paths.

What do you think?

@benwtrent
Copy link
Member Author

It seems like a solution that should work both for testing and production is to say that if a matcher is registered after a timeout then it's instantly interrupted,

Definitely! I will make the change

@benwtrent
Copy link
Member Author

run elasticsearch-ci/packaging-sample-windows

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 4bbd150 into elastic:master Sep 16, 2020
@benwtrent benwtrent deleted the test/ml-fix-watchdog-match-test branch September 16, 2020 12:36
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Sep 16, 2020
…out (elastic#62391)

Constructing the timout checker FIRST and THEN registering the watcher allows the test to have a race condition.

The timeout value could be reached BEFORE the matcher is added. To prevent the matcher never being interrupted, a new timedOut value is added to the watcher thread entry. Then when a new matcher is registered, if the thread was previously timedout, we interrupt the matcher immediately.

closes elastic#48861
benwtrent added a commit that referenced this pull request Sep 16, 2020
…out (#62391) (#62447)

Constructing the timout checker FIRST and THEN registering the watcher allows the test to have a race condition.

The timeout value could be reached BEFORE the matcher is added. To prevent the matcher never being interrupted, a new timedOut value is added to the watcher thread entry. Then when a new matcher is registered, if the thread was previously timedout, we interrupt the matcher immediately.

closes #48861
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test Issues or PRs that are addressing/adding tests v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] TimeoutCheckerTests.testWatchdog failing regularly
4 participants