Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(porep): skip existing layers if they already exist #1282

Closed
wants to merge 13 commits into from

Conversation

dignifiedquire
Copy link
Contributor

This allows resuming sealing operations, as only layers are generated that don't yet exist on disk.

Closes #884

cryptonemo
cryptonemo previously approved these changes Sep 15, 2020
)
.is_ok();
if generated {
// successfull load
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'succesful'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

states
}

fn generate_labels_replicate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you come up with a more descriptive name and/or add a comment to explain exactly what this does? We're getting into a soup of self-referential names which is hard to get a handle on by just looking at one part. It would be good to reverse that trend where we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some docs and improved the name a little, honestly can't come up with something better

@porcuquine
Copy link
Collaborator

@dignifiedquire Can we also get a test — maybe an addition to the seal_lifecycle tests, which exercises this? That will help us release with confidence.

@dignifiedquire
Copy link
Contributor Author

@dignifiedquire Can we also get a test — maybe an addition to the seal_lifecycle tests, which exercises this? That will help us release with confidence.

Added a check in replicate + extract that checks that it detects the already existing layers

@porcuquine
Copy link
Collaborator

This is looking good. There is one detail, I'm not sure is accounted for. I am not 100% sure of the DiskStore's behavior, though, so I will describe both options I see. In both cases, the issue is that a layer may have been only incompletely written to disk.

  1. DiskStore writes its contents incrementally, and the file grows until the whole layer has been written. In that case, we know that a file of the correct size represents a completed layer. We should check for the size.

  2. DiskStore first 'truncates' the file to the full size in order to ensure there is enough room on disk. (This is probably what it should do. If it doesn't, we should consider changing.) In this case, we need some other way of explicitly recording that the on disk representation of the completed layer is also complete.

Either way, it seems possible the file has been partially written then interrupted. In that case, even though the layer will have been completely created, we still need to discard the layer and begin from the last layer completely stored on disk.

Although this may be heavy, one thorough solution would be to build up an incremental digest of the layer throughout the sealing process. This is not wasteful, since replication will leave most cores idle. Then the digest can be written to a status file and checked when restarting. However, this may be overkill, since the primary issue we are trying to address is not disk corruption. In that case, the mere presence of the 'completion status' file would suffice to signal completion.

As above, it might also be that we can simply use file length as a check, but also as above, we should probably aim to fully allocate space for the layer files before writing them. This will allow for better performance (less fragmentation) and also better error handling when disk is full. Incidentally, if we don't do this, then a full disk will definitely create the situation of an only-partially-written DiskStore — a situation from which we can never recover.

Regarding recovery, since we cannot prevent the possibility of a partially-written file from existing, resumption should also ensure cleanup of the partial file. I think that will happen automatically, in the current design, though.

@porcuquine
Copy link
Collaborator

Added a check in replicate + extract that checks that it detects the already existing layers

@dignifiedquire Did you mean to push something that didn't make it? I don't see this.

)
.expect("label generation failed");
for state in label_states {
assert!(state.generated);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@porcuquine this is the place where I check that things are not generated twice

@dignifiedquire
Copy link
Contributor Author

@porcuquine Instead of using the DiskStore, which was a hack anyway, it is now using regular file writes, and doing them atomically, by writing to a tmp file, and renaming afterwards. This should eleviate the issues you described

porcuquine
porcuquine previously approved these changes Sep 22, 2020
Copy link
Collaborator

@porcuquine porcuquine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Would be great if someone can manually verify this too, but tests look reasonable.

dignifiedquire and others added 2 commits September 25, 2020 23:32
This allows resuming sealing operations, as only layers are generated that don't yet exist on disk.

Closes #884
@dignifiedquire dignifiedquire changed the base branch from master to opt/sdr-phase1 September 25, 2020 21:34
@dignifiedquire
Copy link
Contributor Author

closing in favor of #1292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seal should be resumable
3 participants