Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files #2058

Merged
merged 4 commits into from
Dec 5, 2019

Conversation

richardartoul
Copy link
Contributor

@richardartoul richardartoul commented Dec 4, 2019

This P.R fixes a bug in the fileset writer that would trigger a (near) infinite loop when the writer was reused after the previous set of files encountered an error trying to write out their index files. The P.R includes a regression test to ensure the issue doesn't crop up again and verify the included fix.

The bug was caused by the following sequence of events:

  1. For some reason (root cause still pending on that) writing out a set of fileset files encountered an error during the call to Close(), likely that duplicate series IDs had been written into the fileset file which will cause the writer to error out when it tries to write its index-related files.

  2. The writer is reused for writing out an entirely different set of fileset files. Normally this is fine since the call to Open() performs an implicit reset of the writers' state, however, the call to Open() has a bug where it does not reset the state of w.indexEntries which is a slice of all the time series that were written into the file. So now the state is that we're writing out filesets for files X but we're still holding on to indexEntries from files Y.

  3. The fileset that we're writing just so happens to not have any time series for the current block start. This is a normal scenario that can happen when the M3DB nodes are not receiving any writes or briefly after topology changes where a flush may occur for a shard that was recently closed.

  4. After writing 0 time series into the files, Close is called on the fileset writer which triggers the following block of code:

func (w *writer) writeIndexRelatedFiles() error {
	summariesApprox := float64(len(w.indexEntries)) * w.summariesPercent
	summaryEvery := 0
	if summariesApprox > 0 {
		summaryEvery = int(math.Floor(float64(len(w.indexEntries)) / summariesApprox))
	}

	// Write the index entries and calculate the bloom filter
	n, p := uint(w.currIdx), w.bloomFilterFalsePositivePercent
	m, k := bloom.EstimateFalsePositiveRate(n, p)
	bloomFilter := bloom.NewBloomFilter(m, k)

	err := w.writeIndexFileContents(bloomFilter, summaryEvery)
	if err != nil {
		return err
	}

Due to the implementation of EstimateFalsePositiveRate, passing a value of 0 for n will result in 9223372036854775808 being returned for the value of k (the number of hash functions the bloom filter will run for each value that is added to the bloom filter).

Normally this isn't a big deal because when the value of n is 0 there are no time series IDs to add to the bloom filter anyways and bloomfilter.Add() never gets called.

However, due to the aforementioned error writing out the previous set of files + the bug with indexEntries not properly being reset, the call to writeIndexFileContents will run the following function:

func (w *writer) writeIndexFileContents(
	bloomFilter *bloom.BloomFilter,
	summaryEvery int,
) error {
	// NB(r): Write the index file in order, in the future we could write
	// these in order to avoid this sort at the end however that does require
	// significant changes in the storage/databaseShard to store things in order
	// which would sacrifice O(1) insertion of new series we currently have.
	//
	// Probably do want to do this at the end still however so we don't stripe
	// writes to two different files during the write loop.
	sort.Sort(w.indexEntries)

	var (
		offset      int64
		prevID      []byte
		tagsIter    = ident.NewTagsIterator(ident.Tags{})
		tagsEncoder = w.tagEncoderPool.Get()
	)
	defer tagsEncoder.Finalize()
	for i := range w.indexEntries {
		id := w.indexEntries[i].id.Bytes()
		// Need to check if i > 0 or we can never write an empty string ID
		if i > 0 && bytes.Equal(id, prevID) {
			// Should never happen, Write() should only be called once per ID
			return fmt.Errorf("encountered duplicate ID: %s", id)
		}

		var encodedTags []byte
		if tags := w.indexEntries[i].tags; tags.Values() != nil {
			tagsIter.Reset(tags)
			tagsEncoder.Reset()
			if err := tagsEncoder.Encode(tagsIter); err != nil {
				return err
			}
			data, ok := tagsEncoder.Data()
			if !ok {
				return errWriterEncodeTagsDataNotAccessible
			}
			encodedTags = data.Bytes()
		}

		entry := schema.IndexEntry{
			Index:       w.indexEntries[i].index,
			ID:          id,
			Size:        int64(w.indexEntries[i].size),
			Offset:      w.indexEntries[i].dataFileOffset,
			Checksum:    int64(w.indexEntries[i].checksum),
			EncodedTags: encodedTags,
		}

		w.encoder.Reset()
		if err := w.encoder.EncodeIndexEntry(entry); err != nil {
			return err
		}

		data := w.encoder.Bytes()
		if _, err := w.indexFdWithDigest.Write(data); err != nil {
			return err
		}

		// Add to the bloom filter, note this must be zero alloc or else this will
		// cause heavy GC churn as we flush millions of series at end of each
		// time window
		bloomFilter.Add(id)

		if i%summaryEvery == 0 {
			// Capture the offset for when we write this summary back, only capture
			// for every summary we'll actually write to avoid a few memcopies
			w.indexEntries[i].indexFileOffset = offset
		}

		offset += int64(len(data))

		prevID = id
	}

	return nil
}

Since w.indexEntries was never properly reset, bloomfilter.Add() will be called and the goroutine will get stuck in a near infinite loop where it tries to run 9223372036854775808 hash functions.

This issue was extremely hard to debug because it manifested as the M3DB processes turning into "zombies" with 1 CPU core constantly pegged, but the nodes would not respond to any RPCs or networking in general so standard pprof tooling could not be used.

The reason this happened is because all of the function calls within bloomfilter.Add() are inlined making the entire function call unpre-emptible by the G.C until all of the 9223372036854775808 hash functions had been completed.

So when a stop the world G.C was started by the Go runtime it shut down all active goroutines that could have served any network requests and then hung forever waiting for the goroutine running the bloomfilter.Add() to complete so it could begin garbage collection.

This is clearly visible in the output from sudo perf top which shows the Go runtime stuck trying to start a stop the world G.C as well as demonstrates the bloomfilter.Add is clearly stuck in a very long loop based on how much time is being spent on the highlighted assembly instructions.

debbug1
debug2
debug3
debug4

This P.R likely requires several other followups:

  1. Prevent EstimateFalsePositiveRate from returning absurdly large values of K when the value of n is zero.
  2. Figure out under what combination of cold flushing / background repairs and their interactions causes fileset writes to fail immediately after topology changes.

However we will get this P.R merged ASAP to prevent the nodes from getting stuck into undebuggable states.

@codecov
Copy link

codecov bot commented Dec 4, 2019

Codecov Report

Merging #2058 into master will decrease coverage by 15.8%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2058      +/-   ##
==========================================
- Coverage    73.9%    58.1%   -15.9%     
==========================================
  Files        1013     1008       -5     
  Lines      103993   134996   +31003     
==========================================
+ Hits        76954    78553    +1599     
- Misses      22198    49862   +27664     
- Partials     4841     6581    +1740
Flag Coverage Δ
#aggregator 71% <ø> (-9.5%) ⬇️
#cluster 55% <ø> (-30.3%) ⬇️
#collector 23.9% <ø> (-41%) ⬇️
#dbnode 67.9% <100%> (+17.4%) ⬆️
#m3em 21.1% <ø> (-39.2%) ⬇️
#m3ninx 65.1% <ø> (-1.7%) ⬇️
#m3nsch 60.1% <ø> (-10.6%) ⬇️
#metrics 8.3% <ø> (-17.1%) ⬇️
#msg 68.9% <ø> (-5.2%) ⬇️
#query 38.6% <ø> (-31.2%) ⬇️
#x 69.1% <ø> (-11.3%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a585b99...d3a9932. Read the comment docs.

@richardartoul richardartoul force-pushed the ra/fix-writer-infinite-loop branch from 1a11433 to 5199fda Compare December 4, 2019 18:59
@richardartoul richardartoul changed the title Fix (almost) infinite loop in Fileset writes when previous fileset encountered an error writing out index files Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files Dec 4, 2019
@@ -161,6 +161,11 @@ func (w *writer) Open(opts DataWriterOpenOptions) error {
w.currIdx = 0
w.currOffset = 0
w.err = nil
// This happens after writing the previous set of files index files, however, do it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind grouping the reseting code into a new func (w *writer) reset() fn

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tangentially - should indexEntries be bounded/randomly re-allocated if it exceeds certain size to reduce memory usage? not for this PR, just curious myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prateek sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdarulis Theoretically yeah it probably should. In practice we only keep one of these around per node (for the most part) and the size of this slice will never exceed the number of series for a given block/shard combination so in practice its usually in the 10s of thousands tops and each item in the slice is not that big so I think its really unlikely it would become an issue. There are definitely other things like this in the code-base though where we're more paranoid because they could become an issue if left unchecked

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -161,6 +161,11 @@ func (w *writer) Open(opts DataWriterOpenOptions) error {
w.currIdx = 0
w.currOffset = 0
w.err = nil
// This happens after writing the previous set of files index files, however, do it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tangentially - should indexEntries be bounded/randomly re-allocated if it exceeds certain size to reduce memory usage? not for this PR, just curious myself.

@richardartoul richardartoul merged commit 42d1168 into master Dec 5, 2019
@richardartoul richardartoul deleted the ra/fix-writer-infinite-loop branch December 5, 2019 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants