Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbnode] Cache the results of the commitlog bootstapper between runs #2635

Closed
wants to merge 2 commits into from

Conversation

ryanhall07
Copy link
Collaborator

What this PR does / why we need it:
A recent refactoring of cold writes ( #2508) introduced a regression
that results in the commit log being read twice while bootstrapping. The referenced PR changed the
commitlog source to read all time ranges, not just the requested time ranges. This was necessary since
the commitlog may have cold writes that were never commmited to a fileset. Another bootstrapper (filesystem or peer)
might report a ShardTimeRange as fulfilled, but it could be missing recent cold writes from a commit log.

The bootstrapper is split into 2 runs, one for the cold time ranges and one for the recent warm time range. Since the commitlog source
now reads the full time range for both runs, it resulted in 2 full reads of the commit log/snapshots.

This commit introduces a simple fix that caches the results of the commitlog source between runs. Other options
were considered, such as adding an option to disable a specific bootstrapper. This change seemed the most ideal since it
isolates the special logic of the commitlog bootstrapper to a single file.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:
NONE

Does this PR require updating code package or user-facing documentation?:
NONE

A recent refactoring of cold writes ( #2508) introduced a regression
that results in the commit log being read twice while bootstrapping. The referenced PR changed the
commitlog source to read all time ranges, not just the requested time ranges. This was necessary since
the commitlog may have cold writes that were never commmited to a fileset. Another bootstrapper (filesystem or peer)
might report a ShardTimeRange as fulfilled, but it could be missing recent cold writes from a commit log.

The bootstrapper is split into 2 runs, one for the cold time ranges and one for the recent warm time range. Since the commitlog source
now reads the full time range for both runs, it resulted in 2 full reads of the commit log/snapshots.

This commit introduces a simple fix that caches the results of the commitlog source between runs. Other options
were considered, such as adding an option to disable a specific bootstrapper. This change seemed the most ideal since it
isolates the special logic of the commitlog bootstrapper to a single file.
@CLAassistant
Copy link

CLAassistant commented Sep 14, 2020

CLA assistant check
All committers have signed the CLA.

@@ -72,6 +72,14 @@ type commitLogSource struct {
newReaderFn newReaderFn

metrics commitLogSourceMetrics

// Cache the results so the commit log is not read twice. This source is special since it reads the entire
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the commitlog bootstrapper vs this source

Comment on lines 186 to 190
// bail early if this source already ran before.
if s.bootstrapResults.Results != nil {
s.log.Info("the entire range of the commit has already been read. returning previous results")
return s.bootstrapResults, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this before the span is started? Currently we'll have a lot of 0 length spans if the cache is hit

// but it could be missing writes only available in the commit log.
//
// The bootstrapper is single threaded so we don't need a mutex for this.
bootstrapResults bootstrap.NamespaceResults
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the lifecycle of this? Any chance a downstream consumer will release the results in this map between te two reads?

@@ -175,6 +183,12 @@ func (s *commitLogSource) Read(
ctx, span, _ := ctx.StartSampledTraceSpan(tracepoint.BootstrapperCommitLogSourceRead)
defer span.Finish()

// bail early if this source already ran before.
if s.bootstrapResults.Results != nil {
s.log.Info("the entire range of the commit has already been read. returning previous results")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe make this a debug log since this will get logged every time?

Metadata: ns.namespace.Metadata,
Shards: ns.namespace.Shards,
DataResult: dataResult,
IndexResult: indexResult,
})
}

return bootstrapResult, nil
return s.bootstrapResults, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make sure these bootstrap results won't get loaded twice?

@ryanhall07
Copy link
Collaborator Author

closing this in favor of #2645

attempting to skip the commit log bootstrapper in one of the passes does not work since you still want to read the snapshot files for the requested time ranges in each pass.

@ryanhall07 ryanhall07 closed this Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants