-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add a configuration to enable GC of system.rangelog #30913
Conversation
b809190
to
1ad4130
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! @tschottdorf, are you the right person to review?
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 137 at r1 (raw file):
"if non zero, range log entries older than this duration are deleted periodically "+ "based on storage.rangelog.gc_interval", 0,
I think we could probably default this to something like 30d. @bdarnell do you have thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 137 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
I think we could probably default this to something like 30d. @bdarnell do you have thoughts?
Yeah, I think this should be on by default. 30d seems reasonable (or maybe the threshold should be based on the size of the log, to cover cases when the log blows up suddenly due to rebalancer thrashing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be running on every server and there's no need for that. Can we run this only on a specific node? Say check for which node is the leaseholder for the rangelog and only try the delete there.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad this is getting done. I'll echo @BramGruneir's comment.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 89 at r1 (raw file):
const ( // defaultRangeLogGCInterval is the default interval to run gc on rangelog. defaultRangeLogGCInteval = 30 * time.Minute
s/Inteval/Interval/g
(you've got it spelled right in the comment!)
1ad4130
to
8fcfd6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it to run on lease holder of range 1.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 89 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
s/Inteval/Interval/g
(you've got it spelled right in the comment!)
Done.
pkg/server/server.go, line 137 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Yeah, I think this should be on by default. 30d seems reasonable (or maybe the threshold should be based on the size of the log, to cover cases when the log blows up suddenly due to rebalancer thrashing?)
Done.
8fcfd6b
to
2acfad2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a coarse review. This already looks pretty good! A little bit of work is left to do to make sure this doesn't cause problems. Thanks so far @mvijaykarthik!
Reviewed 5 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 41 at r2 (raw file):
"github.com/cockroachdb/cmux"
Remove this newline.
pkg/server/server.go, line 131 at r2 (raw file):
// RangeLogTTL is the TTL for rows in system.rangelog. If non zero, range log // entries are periodically garbage collectd. The period is given by
collected
pkg/server/server.go, line 144 at r2 (raw file):
RangeLogGCInterval = settings.RegisterDurationSetting( "server.rangelog.gc_interval", "interval for running gc on rangelog. If storage.rangelog.ttl is non zero, "+
Shouldn't this simply be a function of server.rangelog.ttl
(i.e. not a cluster var)? I think it should be computed as min(10*time.Minute, RangeLogTTL)
. We don't want to want 30 days to make an attempt or deployments which see restarts every <30 days will never GC the range log. We also can't just blindly run at startup because leaseholdership is often not known then (on top of generating silly errors that might end up in the logs).
pkg/server/server.go, line 1071 at r2 (raw file):
// It returns the node currently responsible for performing GC, the number of // rows affected and error (if any). func (s *Server) gcRangeLog(
please extract all of this new code into a new file server_rangelog_gc.go
.
pkg/server/server.go, line 1074 at r2 (raw file):
ctx context.Context, cutoffTimestamp time.Time, ) (roachpb.NodeID, int, error) { const selectLeaseHolderStmt = `SELECT lease_holder from crdb_internal.ranges where range_id=1`
Careful, last I checked this was really expensive -- it sends an RPC to all nodes, collects information about all ranges, queries all leaseholders in a loop, and then throws it all away just to give you your boolean -- @changangela, I think you worked on this recently in #30846. What I say above is still mostly true, except for the loop to get the leaseholder, right? That would be way too expensive.
I think you can use
repl, err := s.node.stores.GetReplicaForRangeID(roachpb.RangeID(1))
repl.IsFirstRange() && repl.OwnsValidLease(s.clock.Now())
for a much cheaper check.
pkg/server/server.go, line 1075 at r2 (raw file):
) (roachpb.NodeID, int, error) { const selectLeaseHolderStmt = `SELECT lease_holder from crdb_internal.ranges where range_id=1` const deleteStmt = `DELETE FROM system.rangelog WHERE timestamp <= $1 LIMIT 1000`
This is a queue-like workload and as such we should give it both a lower and upper bound so that the query reads
DELETE FROM system.rangelog WHERE timestamp <= $1 AND timestamp >= $2 LIMIT 1000
and in each iteration update the $2 with the last returned result (and you return the final $2 from the method so that it can be passed in again for the next run).
When the loop runs for the first time, I think it's OK if it uses a $2 of zero (hard to avoid).
Also make sure that you exercise the case in which you insert north of 1000 entries to be GC'ed, if you haven't already.
pkg/server/server.go, line 1081 at r2 (raw file):
Rangelog
RangeLog (for both knobs).
pkg/server/server.go, line 1136 at r2 (raw file):
func (s *Server) startRangeLogGC(ctx context.Context) { s.stopper.RunWorker(ctx, func(ctx context.Context) { intervalChangeCh := make(chan time.Duration)
(This is a comment on this whole method)
Unfortunately it's really difficult to write a bug-free and robust implementation of a timer loop that's sensitive to cluster settings in the "optimal" way. Could you structure this exactly like #25227 which had a much much simpler loop (and also went through a couple of iterations in review)? You can start a server that overrides the default to a low value.
Here you have the problem that the cluster setting has an "on" default (30d), so naively it won't work. Instead, you can just override the default for the GC duration to something low by passing a suitable *cluster.Settings
into TestServerArgs. Just look for a test that passes a non-nil *cluster.Settings
into a TestServerArgs
for inspiration. You can set the value with myClusterVar.Set(&st.SV, 3*time.Millisecond)
.
pkg/storage/log_test.go, line 401 at r2 (raw file):
} func TestLogGC(t *testing.T) {
I'd prefer if this lived close to the code that you're testing, in pkg/server
. Any reason to not do that?
pkg/storage/log_test.go, line 518 at r2 (raw file):
a.Equal(2000, rangeLogRowCount()) // GC everything
Stylistic nit: add .
to every sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 1074 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Careful, last I checked this was really expensive -- it sends an RPC to all nodes, collects information about all ranges, queries all leaseholders in a loop, and then throws it all away just to give you your boolean -- @changangela, I think you worked on this recently in #30846. What I say above is still mostly true, except for the loop to get the leaseholder, right? That would be way too expensive.
I think you can use
repl, err := s.node.stores.GetReplicaForRangeID(roachpb.RangeID(1)) repl.IsFirstRange() && repl.OwnsValidLease(s.clock.Now())for a much cheaper check.
Since the ranges
virtual table is now just a view over ranges_no_leases
, the optimizer should be able to apply the filter first and then do the RPC request for only the one leaseholder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 1074 at r2 (raw file):
Previously, changangela (Angela Chang) wrote…
Since the
ranges
virtual table is now just a view overranges_no_leases
, the optimizer should be able to apply the filter first and then do the RPC request for only the one leaseholder.
Oh, but yeah unfortunately it still populates the entirety of the ranges_no_leases
virtual table.
010799c
to
9318ee1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also added gc for system.eventlog as it was a very minor change to also include this.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 41 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Remove this newline.
Done.
pkg/server/server.go, line 131 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
collected
Done.
pkg/server/server.go, line 144 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Shouldn't this simply be a function of
server.rangelog.ttl
(i.e. not a cluster var)? I think it should be computed asmin(10*time.Minute, RangeLogTTL)
. We don't want to want 30 days to make an attempt or deployments which see restarts every <30 days will never GC the range log. We also can't just blindly run at startup because leaseholdership is often not known then (on top of generating silly errors that might end up in the logs).
I have kept this as 10m. If people accidentally set a low TTL, the gc loop may hamper system performance. Mentioned this in the documentation of this setting.
pkg/server/server.go, line 1071 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
please extract all of this new code into a new file
server_rangelog_gc.go
.
Done.
pkg/server/server.go, line 1074 at r2 (raw file):
Previously, changangela (Angela Chang) wrote…
Oh, but yeah unfortunately it still populates the entirety of the
ranges_no_leases
virtual table.
Done. But it's not in the same transaction though. It shouldn't matter correctness wise, and should be fine performance wise as long as the lease doesn't bounce around.
pkg/server/server.go, line 1075 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
This is a queue-like workload and as such we should give it both a lower and upper bound so that the query reads
DELETE FROM system.rangelog WHERE timestamp <= $1 AND timestamp >= $2 LIMIT 1000and in each iteration update the $2 with the last returned result (and you return the final $2 from the method so that it can be passed in again for the next run).
When the loop runs for the first time, I think it's OK if it uses a $2 of zero (hard to avoid).
Also make sure that you exercise the case in which you insert north of 1000 entries to be GC'ed, if you haven't already.
Done. A test case is present with 2000 rows.
pkg/server/server.go, line 1081 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Rangelog
RangeLog (for both knobs).
Done.
pkg/server/server.go, line 1136 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
(This is a comment on this whole method)
Unfortunately it's really difficult to write a bug-free and robust implementation of a timer loop that's sensitive to cluster settings in the "optimal" way. Could you structure this exactly like #25227 which had a much much simpler loop (and also went through a couple of iterations in review)? You can start a server that overrides the default to a low value.
Here you have the problem that the cluster setting has an "on" default (30d), so naively it won't work. Instead, you can just override the default for the GC duration to something low by passing a suitable
*cluster.Settings
into TestServerArgs. Just look for a test that passes a non-nil*cluster.Settings
into aTestServerArgs
for inspiration. You can set the value withmyClusterVar.Set(&st.SV, 3*time.Millisecond)
.
Done. Changed to ticker. Tweaking this interval ideally shouldn't be required.
pkg/storage/log_test.go, line 401 at r2 (raw file):
I'm using the below to insert test rangelog events.
store.LogReplicaChangeTest(
It not usable from server package. That's why I added it here
pkg/storage/log_test.go, line 518 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Stylistic nit: add
.
to every sentence.
Done.
4fa9659
to
759bb39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good already. Some more comments but getting close!
Reviewed 6 of 6 files at r3.
Dismissed @petermattis from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server.go, line 144 at r2 (raw file):
Previously, mvijaykarthik wrote…
I have kept this as 10m. If people accidentally set a low TTL, the gc loop may hamper system performance. Mentioned this in the documentation of this setting.
SGTM.
pkg/server/server_systemlog_gc.go, line 45 at r3 (raw file):
repl, err := s.node.stores.GetReplicaForRangeID(roachpb.RangeID(1)) if err != nil { return 0, err
You want to handle the case in which there isn't a replica for r1 more gracefully to avoid spurious errors.
pkg/server/server_systemlog_gc.go, line 48 at r3 (raw file):
} if repl.IsFirstRange() && repl.OwnsValidLease(s.clock.Now()) {
I'd make that an early return, so you can de-dent all the code below. Same comment as above.
pkg/server/server_systemlog_gc.go, line 56 at r3 (raw file):
// A timestamp lower-bound is used in the delete query to avoid hitting // tombstones. // This lower bound is periodically updated after every successful delete
Sorry if that wasn't clear, but I was also expecting you to return this bound from this method. A large enough range tombstone minefield will cause a noticeable CPU spike and affect unrelated requests' p99. Ideally at some point range tombstones are less of a problem, but they are one today. The only case in which we'll eat the tombstones is when the server first starts and runs this loop.
pkg/server/server_systemlog_gc.go, line 118 at r3 (raw file):
s.stopper.RunWorker(ctx, func(ctx context.Context) { period := 10 * time.Minute
make this a const somewhere so that you can also pretty-print it into the description of the cluster setting (where you refer to the 10min interval).
pkg/server/server_systemlog_gc.go, line 120 at r3 (raw file):
period := 10 * time.Minute if storeKnobs, ok := s.cfg.TestingKnobs.Store.(*storage.StoreTestingKnobs); ok && storeKnobs.SystemLogsGCPeriod != 0 { period = storeKnobs.SystemLogsGCPeriod
Ah, I see. That's why you were able to hard-code the 10min interval. Works for me.
pkg/server/server_systemlog_gc.go, line 144 at r3 (raw file):
if rowsAffected > 0 { log.Infof(ctx, "garbage collected %d rows from %s", rowsAffected, table)
Make this V(1)
. In a sufficiently busy cluster this will log every 10min, and that's not useful.
pkg/storage/log_test.go, line 401 at r2 (raw file):
Previously, mvijaykarthik wrote…
I'm using the below to insert test rangelog events.
store.LogReplicaChangeTest(
It not usable from server package. That's why I added it here
I'd prefer if you took what you needed from func (s *Store) insertRangeLogEvent(
to insert dummy events.
pkg/storage/log_test.go, line 578 at r3 (raw file):
) tc.setting.Override(&s.ClusterSettings().SV, time.Nanosecond)
Please don't use Override
after the server has been started, this is flaky since your custom value can be overridden by a Gossip update. Use SET CLUSTER SETTING
instead. (I gave you bad advice here, sorry about that).
759bb39
to
9c19c03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/server_systemlog_gc.go, line 45 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
You want to handle the case in which there isn't a replica for r1 more gracefully to avoid spurious errors.
Done.
pkg/server/server_systemlog_gc.go, line 48 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
I'd make that an early return, so you can de-dent all the code below. Same comment as above.
Done.
pkg/server/server_systemlog_gc.go, line 56 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Sorry if that wasn't clear, but I was also expecting you to return this bound from this method. A large enough range tombstone minefield will cause a noticeable CPU spike and affect unrelated requests' p99. Ideally at some point range tombstones are less of a problem, but they are one today. The only case in which we'll eat the tombstones is when the server first starts and runs this loop.
Done. I think it will also hit tombstones if the lease holder for range 1 changes.
pkg/server/server_systemlog_gc.go, line 118 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
make this a const somewhere so that you can also pretty-print it into the description of the cluster setting (where you refer to the 10min interval).
Done.
pkg/server/server_systemlog_gc.go, line 144 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Make this
V(1)
. In a sufficiently busy cluster this will log every 10min, and that's not useful.
Done.
pkg/storage/log_test.go, line 401 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
I'd prefer if you took what you needed from
func (s *Store) insertRangeLogEvent(
to insert dummy events.
Done.
pkg/storage/log_test.go, line 578 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Please don't use
Override
after the server has been started, this is flaky since your custom value can be overridden by a Gossip update. UseSET CLUSTER SETTING
instead. (I gave you bad advice here, sorry about that).
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after my comments (none of which are substantial) are addressed. Thanks @mvijaykarthik and good work!
Reviewed 6 of 6 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server.go, line 41 at r4 (raw file):
"github.com/cockroachdb/cmux"
stray newline.
pkg/server/server_systemlog_gc.go, line 44 at r4 (raw file):
"server.rangelog.ttl", fmt.Sprintf( "if non zero, range log entries older than this duration are deleted periodically (%v period).",
are deleted every %s
pkg/server/server_systemlog_gc.go, line 55 at r4 (raw file):
"server.eventlog.ttl", fmt.Sprintf( "if non zero, event log entries older than this duration are deleted periodically (%v period).",
are deleted every %s
pkg/server/server_systemlog_gc.go, line 68 at r4 (raw file):
// performs gc. // The system log table is expected to have a "timestamp" column. // It returns the highest timestamp gcd, number of rows affected and error
If it didn't GC anything, it should return timestampUpperBound
, so I would just say that it returns a lower bound to be used in the next iteration.
Note that there's also a race here: if the ttl is very low, it's possible that events could be inserted at timestamps below the cutoff and so they would be missed.
I think that's OK. Perhaps worth a comment to suggest not setting the TTL lower than, say, a day.
pkg/server/server_systemlog_gc.go, line 71 at r4 (raw file):
// (if any). func (s *Server) gcSystemLog( ctx context.Context, table string, timestampLowerBound, timestampUpperBound time.Time,
just for paranoia, let's make sure that if timestampLowerBound >= timestampUpperBound
, this method works. I think it does (won't remove anything and might regress the cutoff, but that's ok) but please also take a look.
pkg/server/server_systemlog_gc.go, line 77 at r4 (raw file):
if err != nil { if log.V(1) { log.Warning(ctx, "unable to get descriptor for range 1")
Don't want to log the error? Also this would be Infof. In a large cluster most nodes won't have a replica of r1 and will see RangeNotFoundError here.
pkg/server/server_systemlog_gc.go, line 79 at r4 (raw file):
log.Warning(ctx, "unable to get descriptor for range 1") } return time.Time{}, 0, nil
You want to return the old timestamp bound to the caller.
pkg/server/server_systemlog_gc.go, line 86 at r4 (raw file):
log.Warningf(ctx, "not running gc on system.%v as node is not lease holder of range 1", table) } return time.Time{}, 0, nil
ditto
pkg/server/server_systemlog_gc.go, line 149 at r4 (raw file):
is the timestamp below which events are gc'ed.
pkg/server/server_systemlog_gc_test.go, line 80 at r4 (raw file):
txn.OrigTimestamp().GoTime(), testRangeID, 1, // storeID.
optional nit: no dot here (inline comment). Go style guide 🤷♂️
pkg/server/server_systemlog_gc_test.go, line 104 at r4 (raw file):
// GC up to maxTs1. tm, rowsGCd, err := ts.GCSystemLog(ctx, table, timeutil.Unix(0, 0), maxTs1)
nit: maxTS1, maxTS2, ... (Golang style guide wants abbreviations capitalized, like HTTP
not Http
).
9c19c03
to
2407789
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server.go, line 41 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
stray newline.
Ah sorry. My editor auto-formats imports and it keeps adding this newline every time I edit this file.
pkg/server/server_systemlog_gc.go, line 44 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
are deleted every %s
Done.
pkg/server/server_systemlog_gc.go, line 55 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
are deleted every %s
Done.
pkg/server/server_systemlog_gc.go, line 68 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
If it didn't GC anything, it should return
timestampUpperBound
, so I would just say that it returns a lower bound to be used in the next iteration.Note that there's also a race here: if the ttl is very low, it's possible that events could be inserted at timestamps below the cutoff and so they would be missed.
I think that's OK. Perhaps worth a comment to suggest not setting the TTL lower than, say, a day.
Done.
pkg/server/server_systemlog_gc.go, line 71 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
just for paranoia, let's make sure that if
timestampLowerBound >= timestampUpperBound
, this method works. I think it does (won't remove anything and might regress the cutoff, but that's ok) but please also take a look.
I added a test case for equality case. I'll also add one for >
case.
pkg/server/server_systemlog_gc.go, line 77 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Don't want to log the error? Also this would be Infof. In a large cluster most nodes won't have a replica of r1 and will see RangeNotFoundError here.
Done.
pkg/server/server_systemlog_gc.go, line 79 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
You want to return the old timestamp bound to the caller.
Done.
pkg/server/server_systemlog_gc.go, line 86 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
ditto
Done.
pkg/server/server_systemlog_gc.go, line 149 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
is the timestamp below which events are gc'ed.
Done.
pkg/server/server_systemlog_gc_test.go, line 80 at r4 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
optional nit: no dot here (inline comment). Go style guide 🤷♂️
Done.
2407789
to
ba2ad24
Compare
54d5e80
to
afed3b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mvijaykarthik! I had a few edits that were so minor that I went ahead patched them into your branch, I hope that's ok.
bors r+
Reviewed 6 of 6 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Build failed |
system.rangelog table currently grows unboundedly. The rate of growth is slow (as long as there is no replica rebalancing thrashing), but it can still become a problem in long running clusters. This commit adds cluster settings to specify interval and TTL for rows in system.rangelog. By default, TTL of system.rangelog is set to 30 days, and that for system.eventlog to 90 days. Fixes cockroachdb#21260. Release note (sql change): the range log and system events logs will automatically purge records older than 30 and 90 days, respectively. This can be adjusted via the server.rangelog.ttl and server.eventlog.ttl cluster settings.
Overlooked a silly lint error. Second time's the charm. bors r+ |
Thanks @tschottdorf ! |
Build failed |
Flaked on #31246 |
Just triggered the merge for the PR that fixes that, so: bors r+ |
30913: server: add a configuration to enable GC of system.rangelog r=tschottdorf a=mvijaykarthik system.rangelog table currently grows unboundedly. The rate of growth is slow (as long as there is no replica rebalancing thrashing), but it can still become a problem in long running clusters. This commit adds cluster settings to specify interval and TTL for rows in system.rangelog. By default, TTL of system.rangelog is set to 30 days. Fixes #21260 Release note: Add configuration to enable GC of system.rangelog 31239: sql: attempt to deflake distsql physical planner tests r=tschottdorf a=jordanlewis Make sure the range cache is populated before verifying things about it. This seems like a hack, but otherwise I think these will just keep flaking. Closes #25808. Closes #31235. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Jordan Lewis <[email protected]>
Build succeeded |
system.rangelog table currently grows unboundedly. The rate of
growth is slow (as long as there is no replica rebalancing
thrashing), but it can still become a problem in long running
clusters.
This commit adds cluster settings to specify interval and TTL
for rows in system.rangelog.
By default, TTL of system.rangelog is set to 30 days.
Fixes #21260
Release note: Add configuration to enable GC of system.rangelog