Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence table GC speed is slower than user write, and which have conflict when user write and gc #50194

Closed
yangxuanjia opened this issue Jun 15, 2020 · 15 comments · Fixed by #51184
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community

Comments

@yangxuanjia
Copy link

Describe the problem
I see GC use reverse iterator in newest version, it is wonderful for sequence table.
I merge this PR to v19.1, this very good, OOM never occur.
But, I found another problem, gc default timeout is 1 min, the replica enqueue 7-10 min per time.
sequence table have 1k-2k update per second, but GC can reduce sequence table MVCC key 500/s, run 1min can reduce 30k MVCC key. The reslut is GC is slower than user write when table use Sequence table.

MVCC GG Bytes Age | 7d 20h 53m 37s 450ms | 7d 20h 53m 37s 450ms | 7d 20h 53m 37s 450ms
-- | -- | -- | --
26.0 B / 1 count | 26.0 B / 1 count | 26.0 B / 1 count
454.2 MiB / 1 count | 454.2 MiB / 1 count | 454.2 MiB / 1 count
340.7 MiB / 39692528 count | 340.7 MiB / 39692528 count | 340.7 MiB / 39692528 count

I change the GC queue default to dynamic, hope to GC sequence MVCC key frequently.

func makeQueueGcTimeoutFunc() queueProcessTimeoutFunc {
	return func(cs *cluster.Settings, r replicaInQueue) time.Duration {
		minimumTimeout := queueGuaranteedProcessingTimeBudget.Get(&cs.SV)
		// NB: In production code this will type assertion will always succeed.
		// Some tests set up a fake implementation of replicaInQueue in which
		// case we fall back to the configured minimum timeout.
		repl, ok := r.(interface{ GetMVCCStats() enginepb.MVCCStats })
		if !ok {
			return minimumTimeout
		}
		stat := repl.GetMVCCStats()
		if stat.LiveCount == 1 && stat.KeyCount == 1 && stat.ValCount > 1 {
			timeout := time.Duration(stat.ValCount/500+1) * time.Second
			if timeout.Seconds() > 60 {
				return timeout
			}
		}
		return minimumTimeout
	}
}

But I found there have a conflict between GC Sequence table MVCC key and user write record when primary key is auto_increment, the insert performance is reduce too big when GC Sequence table MVCC key.
Is there any methods to avoid the conflict???

To Reproduce
table primary key is auto_increment, and use Sequence table.
table insert 60 million record.
Sequence table have 40 million MVCC values.
change GC sequence default timeout from 1min to 1hour.

Expected behavior
you will find insert performance is reduce too big when GC Sequence table MVCC key.

Additional data / screenshots
Screenshot from 2020-06-15 10-21-04
Screenshot from 2020-06-15 10-21-21

Environment:

  • CockroachDB version [e.g. 19.1.x]
  • Server OS: [e.g. Linux/Distrib]
  • Client app [e.g. cockroach sql, JDBC, ...]
@blathers-crl
Copy link

blathers-crl bot commented Jun 15, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @mattcrdb (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-oncall labels Jun 15, 2020
@yangxuanjia
Copy link
Author

@ajwerner
whether GC can direct delete history key with timestamp from rocksDB to avoid conflict to the intent key in Sequence table?

@yangxuanjia
Copy link
Author

yangxuanjia commented Jun 15, 2020

@ajwerner
I have a big question about MVCCGarbageCollect() in mvcc.go.
If sequence table key have 40 million history version keys, gc_queue use reverse iterator to find deletable key to avoid OOM, why MVCCGarbageCollect use forward iterator to find meta key,
and from meta key to iter 40 million history version keys to find oldest some keys.
this is too slow I think.
Could we after seek to metakey and handle some thing, then follow these steps.

  1. seekReverse to gc_key
  2. pre() to Prev_Key, record the prevNanos for MVCCStats.
  3. seek to gc_key again, and iter from gc_key to the oldest for clear?

By the way, this is not the main problem I care, because I think this is not the reason about conflict.

// MVCCGarbageCollect creates an iterator on the ReadWriter. In parallel
// it iterates through the keys listed for garbage collection by the
// keys slice. The iterator is seeked in turn to each listed
// key, clearing all values with timestamps <= to expiration. The
// timestamp parameter is used to compute the intent age on GC.
func MVCCGarbageCollect(
	ctx context.Context,
	rw ReadWriter,
	ms *enginepb.MVCCStats,
	keys []roachpb.GCRequest_GCKey,
	timestamp hlc.Timestamp,
) error {
	// We're allowed to use a prefix iterator because we always Seek() the
	// iterator when handling a new user key.
	iter := rw.NewIterator(IterOptions{Prefix: true})
	defer iter.Close()

	var count int64
	defer func(begin time.Time) {
		log.Eventf(ctx, "done with GC evaluation for %d keys at %.2f keys/sec. Deleted %d entries",
			len(keys), float64(len(keys))*1e9/float64(timeutil.Since(begin)), count)
	}(timeutil.Now())

	// Iterate through specified GC keys.
	meta := &enginepb.MVCCMetadata{}
	for _, gcKey := range keys {
		encKey := MakeMVCCMetadataKey(gcKey.Key)
		ok, metaKeySize, metaValSize, err := mvccGetMetadata(iter, encKey, meta)
		if err != nil {
			return err
		}
		if !ok {
			continue
		}
		inlinedValue := meta.IsInline()
		implicitMeta := iter.UnsafeKey().IsValue()
		// First, check whether all values of the key are being deleted.
		//
		// Note that we naively can't terminate GC'ing keys loop early if we
		// enter this branch, as it will update the stats under the provision
		// that the (implicit or explicit) meta key (and thus all versions) are
		// being removed. We had this faulty functionality at some point; it
		// should no longer be necessary since the higher levels already make
		// sure each individual GCRequest does bounded work.
		if hlc.Timestamp(meta.Timestamp).LessEq(gcKey.Timestamp) {
			// For version keys, don't allow GC'ing the meta key if it's
			// not marked deleted. However, for inline values we allow it;
			// they are internal and GCing them directly saves the extra
			// deletion step.
			if !meta.Deleted && !inlinedValue {
				return errors.Errorf("request to GC non-deleted, latest value of %q", gcKey.Key)
			}
			if meta.Txn != nil {
				return errors.Errorf("request to GC intent at %q", gcKey.Key)
			}
			if ms != nil {
				if inlinedValue {
					updateStatsForInline(ms, gcKey.Key, metaKeySize, metaValSize, 0, 0)
					ms.AgeTo(timestamp.WallTime)
				} else {
					ms.Add(updateStatsOnGC(gcKey.Key, metaKeySize, metaValSize, meta, meta.Timestamp.WallTime))
				}
			}
			if !implicitMeta {
				if err := rw.Clear(iter.UnsafeKey()); err != nil {
					return err
				}
				count++
			}
		}

		if !implicitMeta {
			// The iter is pointing at an MVCCMetadata, advance to the next entry.
			iter.Next()
		}

		// TODO(tschottdorf): Can't we just Seek() to a key with timestamp
		// gcKey.Timestamp to avoid potentially cycling through a large prefix
		// of versions we can't GC? The batching mechanism in the GC queue sends
		// requests susceptible to that happening when there are lots of versions.
		// A minor complication there will be that we need to know the first non-
		// deletable value's timestamp (for prevNanos).

		// Now, iterate through all values, GC'ing ones which have expired.
		// For GCBytesAge, this requires keeping track of the previous key's
		// timestamp (prevNanos). See ComputeStatsGo for a more easily digested
		// and better commented version of this logic.

		prevNanos := timestamp.WallTime
		for ; ; iter.Next() {
			if ok, err := iter.Valid(); err != nil {
				return err
			} else if !ok {
				break
			}
			unsafeIterKey := iter.UnsafeKey()
			if !unsafeIterKey.Key.Equal(encKey.Key) {
				break
			}
			if !unsafeIterKey.IsValue() {
				break
			}
			if unsafeIterKey.Timestamp.LessEq(gcKey.Timestamp) {
				if ms != nil {
					// FIXME: use prevNanos instead of unsafeIterKey.Timestamp, except
					// when it's a deletion.
					valSize := int64(len(iter.UnsafeValue()))

					// A non-deletion becomes non-live when its newer neighbor shows up.
					// A deletion tombstone becomes non-live right when it is created.
					fromNS := prevNanos
					if valSize == 0 {
						fromNS = unsafeIterKey.Timestamp.WallTime
					}

					ms.Add(updateStatsOnGC(gcKey.Key, MVCCVersionTimestampSize,
						valSize, nil, fromNS))
				}
				count++
				if err := rw.Clear(unsafeIterKey); err != nil {
					return err
				}
			}
			prevNanos = unsafeIterKey.Timestamp.WallTime
		}
	}

	return nil
}

@yangxuanjia
Copy link
Author

@ajwerner
when gc default timeout is 1min, use reverse iterator, I find lots of Warn log.
First I think reverse iterator sent too many GC_Requests cause handle timeout, but when I see MVCCGarbageCollect(), now I think may be forward iterator to iter from metakey case 1min timeout.

W200615 07:02:31.458359 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.468351 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.478365 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.488780 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.498806 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.508799 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.518754 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.529027 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.539274 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.549416 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.559833 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.569881 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.579906 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.589872 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.599858 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded
W200615 07:02:31.609971 25598635 storage/gc/gc.go:329  [n1,gc,s1,r16451/1:/{Table/78-Max}] failed to GC a batch of keys: aborted before acquiring latches: context deadline exceeded

@yangxuanjia
Copy link
Author

yangxuanjia commented Jun 15, 2020

@ajwerner
I use v19.1, and I see the code in cmd_gc.go like this.

func declareKeysGC(
	desc roachpb.RangeDescriptor, header roachpb.Header, req roachpb.Request, spans *spanset.SpanSet,
) {
	// Intentionally don't call DefaultDeclareKeys: the key range in the header
	// is usually the whole range (pending resolution of #7880).
	gcr := req.(*roachpb.GCRequest)
	for _, key := range gcr.Keys {
		spans.Add(spanset.SpanReadWrite, roachpb.Span{Key: key.Key})
	}
	// Be smart here about blocking on the threshold keys. The GC queue can send an empty
	// request first to bump the thresholds, and then another one that actually does work
	// but can avoid declaring these keys below.
	if gcr.Threshold != (hlc.Timestamp{}) {
		spans.Add(spanset.SpanReadWrite, roachpb.Span{Key: keys.RangeLastGCKey(header.RangeID)})
	}
	if gcr.TxnSpanGCThreshold != (hlc.Timestamp{}) {
		spans.Add(spanset.SpanReadWrite, roachpb.Span{
			// TODO(bdarnell): since this must be checked by all
			// reads, this should be factored out into a separate
			// waiter which blocks only those reads far enough in the
			// past to be affected by the in-flight GCRequest (i.e.
			// normally none). This means this key would be special
			// cased and would not acquire latches.
			Key: keys.RangeTxnSpanGCThresholdKey(header.RangeID),
		})
	}
	spans.Add(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeDescriptorKey(desc.StartKey)})
}

but the newest(v20.1) code like this:

func declareKeysGC(
	desc *roachpb.RangeDescriptor,
	header roachpb.Header,
	req roachpb.Request,
	latchSpans, _ *spanset.SpanSet,
) {
	// Intentionally don't call DefaultDeclareKeys: the key range in the header
	// is usually the whole range (pending resolution of #7880).
	gcr := req.(*roachpb.GCRequest)
	for _, key := range gcr.Keys {
		if keys.IsLocal(key.Key) {
			latchSpans.AddNonMVCC(spanset.SpanReadWrite, roachpb.Span{Key: key.Key})
		} else {
			latchSpans.AddMVCC(spanset.SpanReadWrite, roachpb.Span{Key: key.Key}, header.Timestamp)
		}
	}
	// Be smart here about blocking on the threshold keys. The GC queue can send an empty
	// request first to bump the thresholds, and then another one that actually does work
	// but can avoid declaring these keys below.
	if gcr.Threshold != (hlc.Timestamp{}) {
		latchSpans.AddNonMVCC(spanset.SpanReadWrite, roachpb.Span{Key: keys.RangeLastGCKey(header.RangeID)})
	}
	latchSpans.AddNonMVCC(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeDescriptorKey(desc.StartKey)})
}

I not sure, whether the conflict is this code cause, v19.1 not declare the span with key's timestamp,
and cause conflict, impact user insert performance.

Why AddMVCC use header.Timestamp(this is replica clock now()), why not use key.Timestamp to AddMVCC()???

gcr := req.(*roachpb.GCRequest)
	for _, key := range gcr.Keys {
		if keys.IsLocal(key.Key) {
			latchSpans.AddNonMVCC(spanset.SpanReadWrite, roachpb.Span{Key: key.Key})
		} else {
			latchSpans.AddMVCC(spanset.SpanReadWrite, roachpb.Span{Key: key.Key}, header.Timestamp)
		}
	}

@ajwerner
Copy link
Contributor

Relates to #42780.

@yangxuanjia
Copy link
Author

yangxuanjia commented Jun 16, 2020

@ajwerner
GC replica will cause the replica range hang up. I try sleep in gc sequence table, the sequence range will hang up, can't select sequence table, and can't insert data table which associate the sequence table. GC not use transaction, but will use raft.
I think if handle GC raft log too slow, it will hang up the range following raft log.
So I see user insert performance reduce too much, their is not conflict, only raft log serialize handle cause the problem.
Can GC run async?

@ajwerner
Copy link
Contributor

ajwerner commented Jul 7, 2020

I think you are correct that the implementation of MVCCGarbageCollect is probably too slow because it has to iterate over all of the live versions of the key. I think that addressing this TODO would go a long way:

cockroach/pkg/storage/mvcc.go

Lines 3225 to 3230 in a72689b

// TODO(tschottdorf): Can't we just Seek() to a key with timestamp
// gcKey.Timestamp to avoid potentially cycling through a large prefix
// of versions we can't GC? The batching mechanism in the GC queue sends
// requests susceptible to that happening when there are lots of versions.
// A minor complication there will be that we need to know the first non-
// deletable value's timestamp (for prevNanos).

I'm going to move this to the storage team's backlog.

@ajwerner ajwerner added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jul 7, 2020
@ajwerner
Copy link
Contributor

ajwerner commented Jul 8, 2020

FWIW I got nerd sniped by this. A couple of fixes inbound. I've got some fixes that should make this not a problem anymore.

  1. As you noted, with the batching and reverse iteration, we were doing N^2 work hence not being able to keep up.
  2. When there's lots of versions, we can issue clear-range operations to the storage engine and wipe them out rapidly, let's do that.

@petermattis
Copy link
Collaborator

When there's lots of versions, we can issue clear-range operations to the storage engine and wipe them out rapidly, let's do that.

We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 8, 2020

When there's lots of versions, we can issue clear-range operations to the storage engine and wipe them out rapidly, let's do that.

We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range.

Ack. I typed up the change in two parts. I'll split it up. Addressing 1. is certainly worth it. I can understand being hesitant about 2.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 8, 2020

We can litigate the value of 2. with more caution and scrutiny.

@petermattis
Copy link
Collaborator

We can litigate the value of 2. with more caution and scrutiny.

Agreed.

@ajwerner
Copy link
Contributor

I’m leaving this open for more discussion about #51230.

@tbg
Copy link
Member

tbg commented Mar 4, 2021

#61455 now tracks #51230.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants