Splitstore Enhanchements #6474

vyzo · 2021-06-14T17:54:35Z

Subsumes Noop coldstore for splitstore #6458 (see PubSub Booster network integration #6418)
Cherry-picks and subsumes Splitstore optimizations #5808

This PR enhances the logic in splitstore:

Dead experimental code related to full compaction with gc is removed.
We change the compaction range model to be right to left, so that multiple compactions that would happen after a lengthy sync are coalesced to one.
We add logic to implement a noop coldstore so that we can run with fixed hardware requirements (modulo chain headers).
We introduce our own state walking code to replace WalkSnapshot so as to improve performance, be scalable regardless of height, and also include all reachable state objects to account for potential misses.
We walk the entire range from current ts to compaction boundary during compaction, so as to correctly react to sync gaps; previously we only walked at the boundary, which could result in premature purging in the case of sync gaps.
We kill the tracking store. The base utility of the tracking store was to protect new writes during compaction, but tracking the writeEpoch. However, this breaks down (very) badly in reality as vm flushing (see vm.Copy) intelligently tries to avoid duplicate writes and occurs checks with Has.
We fix a race between purge and objects being considered live by higher layers that would lead to a fatal error with a noop coldstore and result in live state objects being prematurely moved to cold storage otherwise. The issue is that between marking and purge an object might be recreated and checked for existence in the blockstore by the vm with Has. If the object was not marked as rechable and the access happened before purge, then it would result in a miss. We fix this by making the compaction transactional, in that during compaction accesses to objects are recorded so as to not purge live objects.
We treat Has as an implicit (recursive) write to account for vm behaviour on Copy.
We keep all headers in the hotstore as it is not currently safe to discard them.

Follow up:

We need a way to mark network initiated blockstore accesses such as not to consider these objects live for compaction purposes and blow the cache. This will need a blockstore interface change to accept options.
We need a way for the vm to explicitly tell us whether to treat a Has request as an implicit write. We currently do it for every invocation (which is safe for now, given our usage patterns), but this is brittle; this will similarly require a blockstore interface change to accept options.
We need to identify how many headers we need to keep in the case of fixed hardware requirements (e.g. boosters), so as not to slowly grow the hotstore with chain headers.

vyzo · 2021-06-28T12:23:14Z

rebased on master.

this is necessary to avoid wearing clown shoes when the node stays offline for an extended period of time (more than 1 finality). Basically it gets quite slow if we do the full 2 finality walk, so we try to avoid it unless necessary. The conditions under which a full walk is necessary is if there is a sync gap (most likely because the node was offline) during which the tracking of writes is inaccurate because we have not yet delivered the HeadChange notification. In this case, it is possible to have actually hot blocks to be tracked before the boundary and fail to mark them accordingly. So when we detect a sync gap, we do the full walk; if there is no sync gap, we can just use the much faster boundary epoch walk.

for maximal safety.

vyzo · 2021-07-10T15:34:35Z

Follow up issue for testing: #6725

so that we preclude the following scenario: Start compaction. Start view. Finish compaction. Start compaction. which would not wait for the view to complete.

Because stebalien has allergies.

Stebalien

Please address my final final review in #6718 before merging. But I'm going to give this a 👍 anyways because the change shouldn't be too difficult.

blockstore/splitstore/splitstore.go

We can add after Wait is called, which is problematic with WaitGroups. This instead uses a mx/cond combo and waits while the count is > 0. The only downside is that we might needlessly wait for (a bunch) of views that started while the txn is active, but we can live with that.

vyzo · 2021-07-13T06:11:29Z

Addressed the final final review issue in 257423e, and finetuned in af39952

vyzo · 2021-07-13T06:14:47Z

Summoning @magik6k -- this is ready for you!

magik6k

Can't spot anything to be nitpicky about, I guess this means we can see if it floats.

blockstore/splitstore/splitstore.go

Stebalien · 2021-07-13T17:03:01Z

blockstore/splitstore/splitstore.go

+					quiet = true
+					log.Warnf("error checking markset: %s", err)
+				}
+				continue


This is still inconsistent with trackTxnRef.

in what way?
Are you concerned about an error not causing the transaction to abort?
Currently the only way the markset errors is if it has been closed, ie the transaction has been aborted.

In trackTxnRef, we track the ref even when we error. Here, we skip when we error.

ah, yes you are right; let me fix it.

fixed in 7785467

Stebalien · 2021-07-13T17:05:44Z

blockstore/splitstore/splitstore.go

 	}
+	s.txnViewsWaiting = false


So, we're waiting for all the views to end, but then we're not doing anything atomically. That can't be right.

Should we be waiting for all the non-tracked views to end? I.e., should we have a return at https://github.com/filecoin-project/lotus/pull/6474/files#diff-eac9e730a0594047de6e81aa421dcd33aac194e505a18945ca45e72db789687cR642?

wait, what? we have already started the transaction.
All the views are tracked now as they always increment the rlock.
The bariier just ensures that there is no view that started before the transaction that hasn't ended.

Ok. This should be fine.

Ideally, we'd track views started after the transaction starts separately, but that's not strictly necessary.

Yeah.

To summarize our sync conversation: a long running view is a bug and i'd very much rather hang the compaction (which is something we'll see in the logs) than do a potentially catastrophic delete.

vyzo requested a review from raulk June 14, 2021 17:54

This was referenced Jun 14, 2021

Noop coldstore for splitstore #6458

Closed

Splitstore optimizations #5808

Closed

jacobheun added the team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs label Jun 15, 2021

jacobheun added this to the v1.11.x milestone Jun 17, 2021

vyzo mentioned this pull request Jun 23, 2021

Splitstore: Online Garbage Collection for the Coldstore #6577

Open

vyzo force-pushed the feat/splitstore-redux branch from 923eaeb to 98c6530 Compare June 28, 2021 12:22

jennijuju modified the milestones: v1.11.0, 1.11. - Backlog for P2, P3 and P4 priority tasks Jun 28, 2021

vyzo requested a review from Stebalien July 2, 2021 19:37

vyzo and others added 18 commits July 4, 2021 18:38

noop blockstore

4d3c73f

hook noop blockstore for splitstore in DI

5cca29d

kill full splitstore compaction, simplify splitstore configuration

04f2e10

implement chain walking

e3cbeec

decrease CompactionThreshold to 3 finalities

d7ceef1

metrics: increment misses in View().

b2b7eb2

don't open bolt tracking store with NoSync, it might get corrupted

e9f531b

keep genesis-linked state hot

7cf75e6

more robust handling of sync gap walks

bdb97d6

first visit the cid, then short-circuit non dagcbor objects

d33a44e

fix test

fda291b

reduce SyncGapTime to 1 minute

fa64814

for maximal safety.

also walk parent message receipts when including messages in the walk

41573f1

refactor genesis state loading code into its own method

7c814cd

keep headers hot when running with a noop splitstore

997f2c0

don't try to visit genesis parent blocks

7b02673

don't attempt compaction while still syncing

3fe4261

vyzo added 8 commits July 9, 2021 19:12

remove unused lookback constructs

18161fe

rename noopstore to discardstore

c0a1cff

update wording around discard store

b9a5ea8

fix test

4129038

refactor debug log code to eliminate duplication

f5ae10e

handle id cids in internal versions of view/get

870a47f

address review comments

0c5e336

fix lint

df9670c

vyzo mentioned this pull request Jul 10, 2021

Splitstore Harness Test #6725

Open

This was referenced Jul 10, 2021

Splitstore: reification of cold objects #6726

Closed

Splitstore Garbage Collection #6728

Closed

jennijuju removed this from the 1.11. - Backlog for P2, P3 and P4 priority tasks milestone Jul 12, 2021

vyzo and others added 3 commits July 13, 2021 03:11

always return the waitgroup in protectView

759594d

so that we preclude the following scenario: Start compaction. Start view. Finish compaction. Start compaction. which would not wait for the view to complete.

put a mutex around HeadChange

60212c8

nit: remove useless goto

04abd19

Because stebalien has allergies.

Stebalien approved these changes Jul 13, 2021

View reviewed changes

blockstore/splitstore/splitstore.go Outdated Show resolved Hide resolved

vyzo added 2 commits July 13, 2021 09:01

finetune view waiting

af39952

vyzo requested a review from magik6k July 13, 2021 06:14

magik6k approved these changes Jul 13, 2021

View reviewed changes

magik6k merged commit c37401a into master Jul 13, 2021

magik6k deleted the feat/splitstore-redux branch July 13, 2021 10:43

Stebalien reviewed Jul 13, 2021

View reviewed changes

jacobheun added the epic/splitstore label Jul 14, 2021

jennijuju added this to the 1.11.1 milestone Jul 19, 2021

LesnyRumcajs mentioned this pull request Jul 22, 2022

Spike: garbage collection in Forest ChainSafe/forest#1708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitstore Enhanchements #6474

Splitstore Enhanchements #6474

vyzo commented Jun 14, 2021 •

edited

Loading

vyzo commented Jun 28, 2021

vyzo commented Jul 10, 2021

Stebalien left a comment

vyzo commented Jul 13, 2021

vyzo commented Jul 13, 2021

magik6k left a comment

Stebalien Jul 13, 2021

vyzo Jul 13, 2021

Stebalien Jul 13, 2021

vyzo Jul 13, 2021

vyzo Jul 13, 2021

Stebalien Jul 13, 2021

vyzo Jul 13, 2021

Stebalien Jul 13, 2021

vyzo Jul 13, 2021

Splitstore Enhanchements #6474

Splitstore Enhanchements #6474

Conversation

vyzo commented Jun 14, 2021 • edited Loading

vyzo commented Jun 28, 2021

vyzo commented Jul 10, 2021

Stebalien left a comment

Choose a reason for hiding this comment

vyzo commented Jul 13, 2021

vyzo commented Jul 13, 2021

magik6k left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyzo commented Jun 14, 2021 •

edited

Loading