Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/metamorphic: TestMeta failed #1459

Closed
cockroach-teamcity opened this issue Jan 20, 2022 · 8 comments · Fixed by #1463
Closed

internal/metamorphic: TestMeta failed #1459

cockroach-teamcity opened this issue Jan 20, 2022 · 8 comments · Fixed by #1463

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 20, 2022

internal/metamorphic.TestMeta failed with artifacts on master @ 7f1a70dc4fb56a501489a453f72f37333e9bccd5:

        // INFO: [JOB 443] sstable deleted 000904
        // INFO: [JOB 443] sstable deleted 000905
        // INFO: [JOB 443] sstable deleted 000906
        // INFO: [JOB 443] sstable deleted 000907
        // INFO: [JOB 443] sstable deleted 000914
        // INFO: [JOB 443] sstable deleted 000915
        // INFO: [JOB 443] sstable deleted 000916
        // INFO: [JOB 443] sstable deleted 000917
        // INFO: [JOB 443] sstable deleted 000918
        // INFO: [JOB 443] sstable deleted 000919
        // INFO: [JOB 443] sstable deleted 000920
        // INFO: [JOB 443] sstable deleted 000921
        // INFO: [JOB 443] sstable deleted 000922
        // INFO: [JOB 443] sstable deleted 000923
        // INFO: [JOB 443] sstable deleted 000924
        // INFO: [JOB 443] sstable deleted 000925
        // INFO: [JOB 443] sstable deleted 000926
        // INFO: [JOB 443] sstable deleted 000927
        // INFO: [JOB 443] sstable deleted 000928
        // INFO: [JOB 443] sstable deleted 000929
        // INFO: [JOB 443] sstable deleted 000930
        // INFO: [JOB 443] sstable deleted 000931
        // INFO: [JOB 443] sstable deleted 000932
        // INFO: [JOB 443] sstable deleted 000933
        // INFO: [JOB 443] sstable deleted 000934
        // INFO: [JOB 443] sstable deleted 000937
        // INFO: [JOB 443] sstable deleted 000938
        // INFO: [JOB 451] compacting(default) L5 [000997] (20 K) + L6 [000809] (1.5 K)
        // INFO: [JOB 449] WAL deleted 000982
        // INFO: [JOB 449] sstable deleted 000976
        // INFO: [JOB 449] sstable deleted 000978
        // INFO: [JOB 449] sstable deleted 000980
        // INFO: [JOB 449] sstable deleted 000981
        // INFO: [JOB 449] sstable deleted 000983
        // INFO: [JOB 449] sstable deleted 000984
        // INFO: [JOB 449] sstable deleted 000985
        // INFO: [JOB 449] sstable deleted 000986
        // INFO: [JOB 449] sstable deleted 000990
        // INFO: [JOB 449] sstable deleted 000991
        // INFO: [JOB 449] sstable deleted 000992
        // INFO: [JOB 449] sstable deleted 000993
        // INFO: [JOB 449] sstable deleted 000994
        // INFO: [JOB 449] sstable deleted 000995
        // INFO: [JOB 449] sstable deleted 000996
        // INFO: [JOB 451] compacting: sstable created 000998
        // INFO: [JOB 451] compacted(default) L5 [000997] (20 K) + L6 [000809] (1.5 K) -> L6 [000998] (15 K), in 0.0s (0.0s total), output rate 19 M/s
        // FATAL: checker failed with error: found InternalKey zzqwavxgrec@12#4092,SINGLEDEL in L0.2: fileNum=000831 and InternalKey zzqwavxgrec@12#4627,DEL in %!s(<nil>)
        db.Compact("heacptnep@12", "kiicbzwtpe@16") // <nil> #16137
        // INFO: [JOB 451] sstable deleted 000997
    --- FAIL: TestMeta/random-016 (0.79s)
Help

To reproduce, try:

 go test -mod=vendor -tags 'invariants' -timeout 0 -test.v -run TestMeta$ ./internal/metamorphic -keep -seed 1642659109724527380 -ops "uniform:5000-25000"

This test on roachdash | Improve this report!

@jbowens
Copy link
Collaborator

jbowens commented Jan 20, 2022

Reproduced locally with

 go test -mod=vendor -tags 'invariants' -timeout 0 -test.v -run TestMeta$ ./internal/metamorphic -keep -seed 1642659109724527380 -ops "uniform:5000-25000"

@jbowens
Copy link
Collaborator

jbowens commented Jan 20, 2022

The panic message looks like it's describing a sequence number inversion bug. Could be a bug in #1432?

@nicktrav
Copy link
Contributor

Confirmed that without #1432 this doesn't happen.

If it helps, I used the following test helper to just run the single variant, which speeds up debugging:

func Test(t *testing.T) {
       runDir := "./_meta/220120-074952.888/random-016" // will be different
       historyPath := "/tmp/hist"
       testMetaRun(t, runDir, 1642659109724527380, historyPath)
}

nicktrav added a commit to nicktrav/pebble that referenced this issue Jan 20, 2022
Currently, if the level checker encounters a violation an the iterator
for the last level has been closed, the debug string providing
information on the level and the file is lost.

Save the debug string for potential use after the iterator is closed.

```
// Before
checker failed with error: found InternalKey zzqwavxgrec@12#4092,SINGLEDEL in L0.2: fileNum=000830 and InternalKey zzqwavxgrec@12#4627,DEL in %!s(<nil>)

// After
checker failed with error: found InternalKey zzqwavxgrec@12#4092,SINGLEDEL in L0.2: fileNum=000830 and InternalKey zzqwavxgrec@12#4627,DEL in L6: fileNum=000997
```

Related to cockroachdb#1459.
@bananabrick
Copy link
Contributor

bananabrick commented Jan 20, 2022

Hi, found another test failure on one of my branches:

--- FAIL: TestMeta (50.01s)
    --- FAIL: TestMeta/random-003 (0.10s)
        meta_test.go:312: 
            ===== SEED =====
            1642705178537754228
            ===== ERR =====
            exit status 1
            ===== OUT =====
            Seed: 0
            checker failed with error: found InternalKey ytug@4#286,DEL in L0.0: fileNum=000044 and InternalKey ytug@4#287,DEL in L6: fileNum=000062
            
            ===== OPTIONS =====
            [Version]
              pebble_version=0.1
            
            [Options]
              bytes_per_sync=2097152
              cache_size=16
              cleaner=delete
              compaction_debt_concurrency=1073741824
              comparer=pebble.internal.testkeys
              delete_range_flush_delay=0s
              disable_wal=false
              flush_split_bytes=2

Don't know if it's the same error, but looking.

@jbowens
Copy link
Collaborator

jbowens commented Jan 20, 2022

This manifest dump clearly shows the sequence number inversion on the key zzqwavxgrec@12:

--- L0.2 ---
  000831:782<#4092-#4092>[zzqwavxgrec@12#4092,SINGLEDEL-zzqwavxgrec@12#4092,SINGLEDEL]
--- L0.1 ---
  000608:793<#2894-#2894>[zzqwavxgrec@12#2894,SET-zzqwavxgrec@12#2894,SET]
--- L0.0 ---
  000913:790<#4370-#4370>[abddymplk@20#4370,MERGE-abddymplk@20#4370,MERGE]
  000942:783<#4439-#4439>[abvukibeofb@13#4439,SET-abvukibeofb@13#4439,SET]
  000538:801<#2526-#2657>[zzqwavxgrec@12#2657,SINGLEDEL-zzqwavxgrec@12#2526,SET]
--- L1 ---
--- L2 ---
--- L3 ---
--- L4 ---
--- L5 ---
--- L6 ---
  000998:15602<#0-#4668>[acutc@6#0,SET-zzqwavxgrec@12#4627,DEL]

The key descended the LSM improperly in an L0->L5 compaction:

1373/7
  next-file-num: 998
  last-seq-num:  4668
  deleted:       L0 000976
  deleted:       L0 000978
  deleted:       L0 000980
  deleted:       L0 000981
  deleted:       L0 000983
  deleted:       L0 000984
  deleted:       L0 000985
  deleted:       L0 000986
  deleted:       L0 000990
  deleted:       L0 000991
  deleted:       L0 000992
  deleted:       L0 000993
  deleted:       L0 000994
  deleted:       L0 000995
  deleted:       L0 000996
  deleted:       L5 000974
  added:         L5 000997:20701<#3934-#4668>[acutc@6#4227,SET-zzqwavxgrec@12#4627,DEL] (2022-01-20T18:15:11Z)

The version edit for the flush that introduced added the key to L0:

973/6
  log-num:       989
  next-file-num: 997
  last-seq-num:  4668
  added:         L0 000990:1247<#4613-#4667>[aiinjp@20#4667,SET-fcklu@5#72057594037927935,RANGEDEL] (2022-01-20T18:15:11Z)
  added:         L0 000991:893<#4632-#4668>[fcklu@5#4668,MERGE-glpw@1#72057594037927935,RANGEDEL] (2022-01-20T18:15:11Z)
  added:         L0 000992:1002<#4632-#4662>[glpw@1#4662,RANGEDEL-mlgxnog@19#72057594037927935,RANGEDEL] (2022-01-20T18:15:11Z)
  added:         L0 000993:915<#4640-#4662>[mlgxnog@19#4662,RANGEDEL-nwnmqtyvjt@5#72057594037927935,RANGEDEL] (2022-01-20T18:15:11Z)
  added:         L0 000994:1446<#4610-#4666>[nwnmqtyvjt@5#4662,RANGEDEL-wmkrrxp@6#72057594037927935,RANGEDEL] (2022-01-20T18:15:11Z)
  added:         L0 000995:1050<#4611-#4664>[wmkrrxp@6#4657,MERGE-yyquzcd@21#4624,SET] (2022-01-20T18:15:11Z)
  added:         L0 000996:803<#4627-#4636>[zslykqao@12#4636,SINGLEDEL-zzqwavxgrec@12#4627,DEL] (2022-01-20T18:15:11Z)

@jbowens
Copy link
Collaborator

jbowens commented Jan 20, 2022

Instrumented to collect the verbose L0Sublevels debug info (L0Sublevels).describe(true):

file count: 20, sublevels: 5, intervals: 30
flush split keys(23): [abddymplk@20, abvukibeofb@13, ajoqjxr@16, dusu@10, duyeldgvnll@21, dygfdczcax@15, ewqqtp@15, fcklu@5, fhcykuix@5, glpw@1, kiati@10, mgksrvk@15, mlgxnog@19, nirnrarzktp@12, nwnmqtyvjt@5, vtocgpw@18, wmkrrxp@6, yyquzcd@21, zaowx@3, zaygjmy@1, zjyqka@1, znnoar@20, zzqwavxgrec@12]
0.4: file count: 5, bytes: 5503, width (mean, max): 3.2, 6, interval range: [4, 19]
	000990:aiinjp@20#4667,1-fcklu@5#72057594037927935,15
	000991:fcklu@5#4668,2-glpw@1#72057594037927935,15
	000992:glpw@1#4662,15-mlgxnog@19#72057594037927935,15
	000993:mlgxnog@19#4662,15-nwnmqtyvjt@5#72057594037927935,15
	000994:nwnmqtyvjt@5#4662,15-wmkrrxp@6#72057594037927935,15
0.3: file count: 3, bytes: 2716, width (mean, max): 4.7, 11, interval range: [8, 28]
	000981:dygfdczcax@15#4609,0-vtocgpw@18#4609,0
	000995:wmkrrxp@6#4657,2-yyquzcd@21#4624,1
	000996:zslykqao@12#4636,7-zzqwavxgrec@12#4627,0
0.2: file count: 4, bytes: 3548, width (mean, max): 2.2, 5, interval range: [11, 28]
	000984:fhcykuix@5#4601,2-kiati@10#4595,2
	000980:mgksrvk@15#4598,0-mgksrvk@15#4598,0
	000985:nirnrarzktp@12#4600,2-zaowx@3#4602,1
	000831:zzqwavxgrec@12#4092,7-zzqwavxgrec@12#4092,7
0.1: file count: 3, bytes: 2552, width (mean, max): 5.3, 14, interval range: [6, 28]
	000983:dusu@10#4603,1-duyeldgvnll@21#4605,1
	000976:ewqqtp@15#4591,15-zaygjmy@1#72057594037927935,15
	000608:zzqwavxgrec@12#2894,1-zzqwavxgrec@12#2894,1
0.0: file count: 5, bytes: 5408, width (mean, max): 4.6, 19, interval range: [0, 28]
	000913:abddymplk@20#4370,2-abddymplk@20#4370,2
	000942:abvukibeofb@13#4439,1-abvukibeofb@13#4439,1
	000978:ajoqjxr@16#4578,2-zjyqka@1#4544,0
	000986:znnoar@20#4604,7-znnoar@20#4604,7
	000538:zzqwavxgrec@12#2657,7-zzqwavxgrec@12#2526,1
compacting file count: 0, base compacting intervals: none

nicktrav added a commit that referenced this issue Jan 20, 2022
Currently, if the level checker encounters a violation an the iterator
for the last level has been closed, the debug string providing
information on the level and the file is lost.

Save the debug string for potential use after the iterator is closed.

```
// Before
checker failed with error: found InternalKey zzqwavxgrec@12#4092,SINGLEDEL in L0.2: fileNum=000830 and InternalKey zzqwavxgrec@12#4627,DEL in %!s(<nil>)

// After
checker failed with error: found InternalKey zzqwavxgrec@12#4092,SINGLEDEL in L0.2: fileNum=000830 and InternalKey zzqwavxgrec@12#4627,DEL in L6: fileNum=000997
```

Related to #1459.
@jbowens
Copy link
Collaborator

jbowens commented Jan 20, 2022

Fix is incoming. Taking the opportunity to convert TestOverlaps into a datadriven test.

jbowens added a commit to jbowens/pebble that referenced this issue Jan 20, 2022
When Version.Overlaps is called for L0, the overlap window is
iteratively expanded until stable. In cockroachdb#1432, the Overlaps function was
adjusted to allow specifying that the end bound should be considered
exclusive. However, cockroachdb#1432 failed to update the exclusivity of the end
bound when the range widened. This improperly excluded files with
largest keys that exactly equaled the new widened end bound.

This commit also transforms the TestOverlaps test into a datadriven
test, introducing a few helpers for parsing the DebugString output of a
Version.

Fix cockroachdb#1459.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jan 21, 2022
74318: tracing: add /debug/tracez rendering the active spans  r=andreimatei a=andreimatei

`/debug/tracez` lets users take a snapshot of the active spans registry
and render the new snapshot, or one of the previously taken snapshots.
The Tracer can hold up to 10 snapshots in memory.

It looks like this:
![Screenshot from 2022-01-04 19-03-39](https://user-images.githubusercontent.com/377201/148140272-306658d5-5b9c-4f2a-b59c-28df9c5ed10c.png)


When visualizing a snapshot, the page lets you do a number of things:
1. List all the spans.
2. See the (current) stack trace for each span's goroutine (if the
   goroutine was still running at the time when the snapshot was
   captured). Stack traces can be toggled visible/hidden.
3. Sort the spans by name or start time.
4. Filter the span according to text search. The search works across
   the name and stack trace.
5. Go from a span to the full trace containing that span.   

For the table Javascript providing sorting and filtering, this patch
embeds the library from https://listjs.com/ .

Limitations:
- for now, only the registry of the local node is snapshotted. In the
  fuiture I'll collect info from all nodes.
- for now, the relationships between different spans are not represented
  in any way. I'll work on the ability to go from a span to the whole
  trace that the span is part of.
- for now, tags and structured and unstructured log messages that a span
  might have are not displayed in any way.

At the moment, span creation is not enabled in production by default
(i.e. the Tracer is put in `TracingModeOnDemand` by default, instead of
the required `TracingModeActiveSpansRegistry`). This patch does not change
that, so in order to benefit from /debug/tracez in all its glory, one
has to run with `COCKROACH_REAL_SPANS=1` for now. Not for long, though.

Release note: None

74867: sql: Support CREATE DATABASE WITH OWNER r=Fenil-P a=Fenil-P

fixes #67817

Release note (sql change): Allow users to specify the owner when creating a database. 
			                      Similar to postgresql: CREATE DATABASE name [ [ WITH ] [ OWNER [=] user_name ]



74871: sql: add a tracing tag with the txn ID r=andreimatei a=andreimatei

This patch adds the txn's ID as a tag to the tracing span representing a
SQL txn. I'm creating a UI to explore the current spans, and this ID
will make it easy to navigate between a query/request blocking on a lock
held by some other txn, and the activity of that other txn.

Release note: None

75114: sql: directly specify columns in TableReader r=RaduBerinde a=RaduBerinde

~Note: the first commit is #74922.~

The internal columns of the TableReader (as well as the row fetcher)
are all the columns of the table, with only a subset of values
actually produced. This design decision has been carried over way past
the point where it makes sense (I admit, it's questionable whether it
ever made sense). For one, "all the columns" is ambiguous (does it
contain non-public columns? does it include system columns?) leading
to various flags and inherent fragility. Second, it relies on the
execution engine to figure out (based on the PostProcessSpec) which
columns are actually needed, which the optimizer already figures out
for us now.

This commit changes the TableReader spec and the interface of
row.Fetcher to always produce a given specific set of column IDs. The
diagram for table readers now specifies the columns by name.

The JoinReader, InvertedJoiner, ZigzagJoiner are not changed in this
commit (but they should be cleaned up as well).

Release note: None


75175: colfetcher: fix the bytes read statistic collection r=yuzefovich a=yuzefovich

During 21.2 release we adjusted the `cFetcher` to be `Close`d eagerly
when it is returning the zero-length batch. This was done in order to
release some references in order for the memory to be GCed sooner;
additionally, the `cFetcher` started being used for the index join where
the fetcher is restarted from scratch for every batch of spans, so it
seemed reasonable to close it automatically.

However, that eager closure broke "bytes read" statistic collection
since the `row.KVFetcher` was responsible for providing it, and we were
zeroing it out. This commit fixes this problem by the `cFetcher`
memorizing the number of bytes it has read in `Close`. Some care needs
to be taken to not double-count the bytes read in the index join, so
a couple of helper methods have been introduced.

Additionally this commit applies the same eager-close optimization to
the `cFetcher` when the last batch is returned (which makes it so that
if we've just exhausted all KVs, we close the fetcher - previously, we
would set the zero length on the batch and might never get into
`stateFinished`).

Fixes: #75128.

Release note (bug fix): Previously, CockroachDB could incorrectly report
`KV bytes read` statistic in `EXPLAIN ANALYZE` output. The bug is
present only in 21.2.x versions.

75215: cmd/github-post: fix Pebble metamorphic reproduction command r=jbowens a=jbowens

When posting a github issue for a Pebble metamorphic test failure, include the
correct `-ops` flag.

Discovered because cockroachdb/pebble#1459 contained a
reproduction command that contained too few ops to reproduce the issue.

Release note: none

75228: logictestccl: skip flaky TestCCLLogic/fakedist-metadata/partitioning_enum r=mgartner a=mgartner

Informs #75227

Release note: None

75237: cli,rpc: don't check the active cluster version in the CLI r=andreimatei a=knz

This commit removes a code path that would tickle an assertion failure
if we were to later fix the context propagation in the RPC heartbeat
method (see PR #71243): there's no "active cluster version" in the CLI
and so we can't compare it in a client interceptor.

Release note: None

75254: scripts: add `dev generate --mirror` to `bump-pebble.sh` script r=jbowens a=nicktrav

CI now expects that dependencies are mirrored to cloud storage and will
fail if the TODO for mirroring the repo is left unaddressed in the
`DEPS.bzl` file.

Add a mirroring step to the `bump-pebble.sh` script.

Release note: none

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Fenil Patel <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
jbowens added a commit to jbowens/pebble that referenced this issue Jan 21, 2022
When Version.Overlaps is called for L0, the overlap window is
iteratively expanded until stable. In cockroachdb#1432, the Overlaps function was
adjusted to allow specifying that the end bound should be considered
exclusive. However, cockroachdb#1432 failed to update the exclusivity of the end
bound when the range widened. This improperly excluded files with
largest keys that exactly equaled the new widened end bound.

This commit also transforms the TestOverlaps test into a datadriven
test, introducing a few helpers for parsing the DebugString output of a
Version.

Fix cockroachdb#1459.
jbowens added a commit that referenced this issue Jan 21, 2022
When Version.Overlaps is called for L0, the overlap window is
iteratively expanded until stable. In #1432, the Overlaps function was
adjusted to allow specifying that the end bound should be considered
exclusive. However, #1432 failed to update the exclusivity of the end
bound when the range widened. This improperly excluded files with
largest keys that exactly equaled the new widened end bound.

This commit also transforms the TestOverlaps test into a datadriven
test, introducing a few helpers for parsing the DebugString output of a
Version.

Fix #1459.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants