-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: import/nodeShutdown/worker failed #81353
Comments
I'm going to leave the release blocker on this for now. This test starts a job, kills a node, and asserts that the job should still run until completion. In this test run. The job was created using a gateway node n2. It was adopted by n4. The test runner killed n3.
The job then failed because of n3's breaker being open on n4:
|
I don't think this needs to block the release, but it is something we should continue to look into. |
roachtest.import/nodeShutdown/worker failed with artifacts on release-22.1 @ 055c8b6bfb804ac6ddbfe1937ddd49d9b2e5eac1:
|
We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors. But, there are actually two circuit breaker packages we use: github.com/cockroachdb/circuitbreaker github.com/cockroachdb/cockroach/pkg/util/circuit Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both. As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried. Fixes cockroachdb#89159 Fixes cockroachdb#85111 Fixes cockroachdb#81353 I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful. Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT.
89261: memo: fix zigzag join stats and costs r=msirek a=msirek Fixes https://github.com/cockroachlabs/support/issues/1821 Non-covering zigzag join can have a selectivity estimate orders of magnitude lower than competing plans, causing its cost to be underestimated. This can make the optimizer choose zigzag join when there are many qualified rows, which is known to perform poorly. Also, the per-row cost of zigzag join is underestimated so that even if selectivity estimates are accurate, the optimizer may still plan a query using a slow zigzag join. The selectivity issue is due to a difference between how `buildSelect` and `buildZigZagJoin` in the `statisticsBuilder` treat constraints (A filtered Select from the base table should have the same selectivity as the zigzag join). In `buildSelect`, `filterRelExpr` builds a filtered histogram via `applyFilters` with new `DistinctCount`s, then calculates selectivity on the constrained columns, taking into account which `histCols` already adjusted `DistinctCount`. ``` numUnappliedConjuncts, constrainedCols, histCols := sb.applyFilters(filters, e, relProps, false /* skipOrTermAccounting */) ... corr := sb.correlationFromMultiColDistinctCounts(constrainedCols, e, s) s.ApplySelectivity(sb. selectivityFromConstrainedCols(constrainedCols, histCols, e, s, corr)) ``` In `buildZigZagJoin`, `applyFilters` is also called, but the information about which columns adjusted stats is not considered: ``` multiColSelectivity, _ := sb.selectivityFromMultiColDistinctCounts(constrainedCols, zigzag, s) s.ApplySelectivity(multiColSelectivity) ``` The solution is to update `buildZigZagJoin` to match the logic in `filterRelExpr`. This can't be done for zigzag join on inverted indexes because the constraints aren't pushed into the ON clause. Validating zigzag join stats on inverted indexes is left for future work. The costing issue is simply that seek costs are using `seqIOCostFactor` instead of `randIOCostFactor` like lookup join and inverted join use: ``` cost := memo.Cost(rowCount) * (2*(cpuCostFactor+seqIOCostFactor) + scanCost + filterPerRow) ``` Every time zigzag join zigs or zags and starts a new scan, that initial read is like a random IO and incurs some startup overhead. In fact, profiling has shown it to be quite expensive. The solution is to make the seek cost be at least on par with lookup join by replacing `seqIOCostFactor` with `randIOCostFactor + lookupJoinRetrieveRowCost`. Further fine-tuning may be needed. It may be possible to speed up zigzag join by trying a point lookup to find a match in the other index before starting a new scan. This improvement and refinement of costs could be done simultaneously. Release note (bug fix): This patch fixes optimizer selectivity and cost estimates of zigzag join in order to prevent query plans from using it when it would perform poorly (when many rows are qualified). 89354: jobs,sql/importer: retry circuit breaker open errors r=dt a=stevendanna We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors. But, there are actually two circuit breaker packages we use: github.com/cockroachdb/circuitbreaker github.com/cockroachdb/cockroach/pkg/util/circuit Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both. As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried. Fixes #89159 Fixes #85111 Fixes #81353 I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful. Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT. Co-authored-by: Mark Sirek <[email protected]> Co-authored-by: Steven Danna <[email protected]>
We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors. But, there are actually two circuit breaker packages we use: github.com/cockroachdb/circuitbreaker github.com/cockroachdb/cockroach/pkg/util/circuit Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both. As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried. Fixes #89159 Fixes #85111 Fixes #81353 I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful. Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT.
We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors. But, there are actually two circuit breaker packages we use: github.com/cockroachdb/circuitbreaker github.com/cockroachdb/cockroach/pkg/util/circuit Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both. As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried. Fixes #89159 Fixes #85111 Fixes #81353 I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful. Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT.
We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors. But, there are actually two circuit breaker packages we use: github.com/cockroachdb/circuitbreaker github.com/cockroachdb/cockroach/pkg/util/circuit Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both. As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried. Fixes cockroachdb#89159 Fixes cockroachdb#85111 Fixes cockroachdb#81353 I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful. Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT.
We would like to retry circuit breaker open errors. In fact, jobs.IsPermanentBulkJobError already looks like it would return false for breaker open errors. But, there are actually two circuit breaker packages we use: github.com/cockroachdb/circuitbreaker github.com/cockroachdb/cockroach/pkg/util/circuit Both define ErrBreakerOpen. IsPermanentBulkJobError would only catch errors from one of these packages. Now, we test for both. As a result, ErrBreakerOpen errors emerging from the nodedialer will now be retried. Fixes #89159 Fixes #85111 Fixes #81353 I may be being a bit optimistic that this will fully fixe those failures. Success of the job still requires that the retry of the job is successful. Release note (bug fix): Fix bug that resulted in some retriable errors not being retried during IMPORT.
roachtest.import/nodeShutdown/worker failed with artifacts on release-22.1 @ f6578028f0b029cbb623bd278c57c6bef85c4835:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-15229
The text was updated successfully, but these errors were encountered: