Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: TestRule is flaky. #3425

Closed
bwplotka opened this issue Nov 9, 2020 · 8 comments
Closed

e2e: TestRule is flaky. #3425

bwplotka opened this issue Nov 9, 2020 · 8 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented Nov 9, 2020

Skipped it for now. See: https://github.com/thanos-io/thanos/pull/3410/checks?check_run_id=1374890903

@stale
Copy link

stale bot commented Jan 9, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2021
@stale
Copy link

stale bot commented Jan 23, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jan 23, 2021
@onprem
Copy link
Member

onprem commented Apr 8, 2021

Looks like we are still experiencing this.

@onprem onprem reopened this Apr 8, 2021
@onprem onprem removed the stale label Apr 8, 2021
@idoqo
Copy link
Contributor

idoqo commented Apr 20, 2021

Part of the logs from https://github.com/thanos-io/thanos/pull/4057/checks?check_run_id=2342647539

level=error ts=2021-04-14T11:49:21.001349623Z msg="function failed. Retrying in next tick" err="unexpected result size, expected 3; result 2: ALERTS{alertname=\"TestAlert_WarnOnPartialResponse\", alertstate=\"firing\", replica=\"1\", severity=\"page\"} => 1 @[1618400960.999]\nALERTS{alertname=\"TestAlert_AbortOnPartialResponse\", alertstate=\"firing\", replica=\"1\", severity=\"page\"} => 1 @[1618400960.999]"
level=error ts=2021-04-14T11:49:22.00008212Z msg="function failed. Retrying in next tick" err="unexpected result size, expected 3; result 2: ALERTS{alertname=\"TestAlert_WarnOnPartialResponse\", alertstate=\"firing\", replica=\"1\", severity=\"page\"} => 1 @[1618400961.998]\nALERTS{alertname=\"TestAlert_AbortOnPartialResponse\", alertstate=\"firing\", replica=\"1\", severity=\"page\"} => 1 @[1618400961.998]"
level=error ts=2021-04-14T11:49:23.000440617Z msg="function failed. Retrying in next tick" err="unexpected result size, expected 3; result 2: ALERTS{alertname=\"TestAlert_WarnOnPartialResponse\", alertstate=\"firing\", replica=\"1\", severity=\"page\"} => 1 @[1618400962.998]\nALERTS{alertname=\"TestAlert_AbortOnPartialResponse\", alertstate=\"firing\", replica=\"1\", severity=\"page\"} => 1 @[1618400962.998]"
=== CONT  TestRule
Error:     rule_test.go:328: query_test.go:437:
        
         unexpected error: unexpected result size, expected 3; result 2: ALERTS{alertname="TestAlert_WarnOnPartialResponse", alertstate="firing", replica="1", severity="page"} => 1 @[1618400962.998]
        ALERTS{alertname="TestAlert_AbortOnPartialResponse", alertstate="firing", replica="1", severity="page"} => 1 @[1618400962.998]
        
11:49:23 Killing querier-1
11:49:23 rule-1: level=error name=rule-1 ts=2021-04-14T11:49:23.874065566Z caller=rule.go:799 component=rules err="read query instant response: perform POST request against http://e2e_test_rule-querier-1:8080/api/v1/query: Post \"http://e2e_test_rule-querier-1:8080/api/v1/query\": read tcp 172.26.0.4:54624->172.26.0.5:8080: read: connection reset by peer" query=absent(some_metric)
11:49:23 rule-1: level=warn name=rule-1 ts=2021-04-14T11:49:23.874122466Z caller=manager.go:598 component=rules group=example_warn msg="Evaluating rule failed" rule="alert: TestAlert_WarnOnPartialResponse\nexpr: absent(some_metric)\nlabels:\n  severity: page\nannotations:\n  summary: I always complain and allow partial response in query.\n" err="no query API server reachable"
11:49:23 rule-1: level=error name=rule-1 ts=2021-04-14T11:49:23.922053646Z caller=rule.go:799 component=rules err="read query instant response: perform POST request against http://e2e_test_rule-querier-1:8080/api/v1/query: Post \"http://e2e_test_rule-querier-1:8080/api/v1/query\": dial tcp 172.26.0.5:8080: connect: connection refused" query=absent(some_metric)
11:49:23 rule-1: level=warn name=rule-1 ts=2021-04-14T11:49:23.922100746Z caller=manager.go:598 component=rules group=example_abort msg="Evaluating rule failed" rule="alert: TestAlert_AbortOnPartialResponse\nexpr: absent(some_metric)\nlabels:\n  severity: page\nannotations:\n  summary: I always complain, but I don't allow partial response in query.\n" err="no query API server reachable"
11:49:23 rule-1: level=error name=rule-1 ts=2021-04-14T11:49:23.965971529Z caller=rule.go:799 component=rules err="read query instant response: perform POST request against http://e2e_test_rule-querier-1:8080/api/v1/query: Post \"http://e2e_test_rule-querier-1:8080/api/v1/query\": dial tcp 172.26.0.5:8080: connect: connection refused" query=absent(some_metric)
11:49:23 rule-1: level=warn name=rule-1 ts=2021-04-14T11:49:23.966393128Z caller=manager.go:598 component=rules group=example_warn msg="Evaluating rule failed" rule="alert: TestAlert_WarnOnPartialResponse\nexpr: absent(some_metric)\nlabels:\n  severity: page\nannotations:\n  summary: I always complain and allow partial response in query.\n" err="no query API server reachable"
11:49:24 rule-1: level=error name=rule-1 ts=2021-04-14T11:49:24.102304474Z caller=rule.go:799 component=rules err="read query instant response: perform POST request against http://e2e_test_rule-querier-1:8080/api/v1/query: Post \"http://e2e_test_rule-querier-1:8080/api/v1/query\": dial tcp: lookup e2e_test_rule-querier-1 on 127.0.0.11:53: server misbehaving" query=absent(some_metric)
11:49:24 rule-1: level=warn name=rule-1 ts=2021-04-14T11:49:24.102342774Z caller=manager.go:598 component=rules group=example_warn msg="Evaluating rule failed" rule="alert: TestAlert_WarnOnPartialResponse\nexpr: absent(some_metric)\nlabels:\n  severity: page\nannotations:\n  summary: I always complain and allow partial response in query.\n" err="no query API server reachable"
11:49:24 Killing rule-1
11:49:24 Killing alertmanager-2
11:49:24 Killing alertmanager-1
--- FAIL: TestRule (181.92s)
    --- PASS: TestRule/no_query_configured (0.01s)
    --- PASS: TestRule/attach_query (10.67s)
FAIL
FAIL	github.com/thanos-io/thanos/test/e2e	272.768s
?   	github.com/thanos-io/thanos/test/e2e/e2ethanos	[no test files]
FAIL
Error: make: *** [Makefile:243: test-e2e] Error 1
Error: Process completed with exit code 2.

@idoqo
Copy link
Contributor

idoqo commented May 9, 2021

Occasionally, the flakiness shows up because ruler is filtered out of the list of matching Store APIs to ask for a given query with the error: No StoreAPIs matched for this query. Here is part of the log from https://github.com/thanos-io/thanos/pull/4057/checks?check_run_id=2402575476

 17:01:08 rule-1: level=warn name=rule-1 ts=2021-04-21T17:01:08.764235995Z caller=manager.go:598 component=rules group=example_warn msg="Evaluating rule failed" rule="alert: TestAlert_WarnOnPartialResponse\nexpr: absent(some_metric)\nlabels:\n  severity: page\nannotations:\n  summary: I always complain and allow partial response in query.\n" err="no query API server reachable"
17:01:08 querier-1: level=warn name=querier-1 ts=2021-04-21T17:01:08.770314135Z caller=proxy.go:279 component=proxy request="min_time:1619024168769 max_time:1619024468769 matchers:<name:\"__name__\" value:\"ALERTS\" > aggregates:COUNT aggregates:SUM " err="No StoreAPIs matched for this query" stores="store Addr: e2e_test_rule-rule-1:9091 LabelSets: {replica=\"1\"} Mint: 9223372036854775807 Maxt: 9223372036854775807 filtered out: does not have data within this time period: [1619024168769,1619024468769]. Store time ranges: [9223372036854775807,9223372036854775807]"

Sometimes, this only shows up as a warning in the logs, other times, the test completely fails as a result.

Not sure what it means, but it looks like somehow, the time ranges advertised by ruler, in this case, are way out of the time frame expected for the querier. For example, here is the output of the two timestamps above:

$ date -d @1619024168769
-----------------------
Wednesday, April 21, 2021 4:56:08.769 PM

vs

$ date -d @9223372036854775807
------------------------------
date: time ‘9223372036854775807’ is out of range

Commenting out the if statement below removes the error locally.

if mint > storeMaxTime || maxt < storeMinTime {

I wonder if it's possible to modify the time range ruler (or any StoreAPI) is advertising.

@stale
Copy link

stale bot commented Jul 11, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 11, 2021
@idoqo
Copy link
Contributor

idoqo commented Jul 12, 2021

@yeya24 I think #4057 fixes this so we can close it now.

@stale stale bot removed the stale label Jul 12, 2021
@bwplotka
Copy link
Member Author

Let's close and reopen if still happening, thanks @idoqo !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants