Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contents of validator defs. file is deleted #2159

Closed
maa-x opened this issue Jan 17, 2021 · 5 comments
Closed

Contents of validator defs. file is deleted #2159

maa-x opened this issue Jan 17, 2021 · 5 comments
Assignees
Labels
A0 bug Something isn't working

Comments

@maa-x
Copy link

maa-x commented Jan 17, 2021

Description

Having run out of space on my Geth ethereum 1 node, I switched over to infura and deleted the chaindata directory and stopped the Geth service. After syncing back up with the beacon, which is now running without errors, the validator client is refusing to validate.

In the logs, all I can see is:

Jan 17 12:04:30 xxx lighthouse[1730]: Jan 17 12:04:30.412 CRIT Failed to start validator client        reason: Unable to open or create validator definitions: UnableToParseFile(EndOfStream)
Jan 17 12:04:30 xxx lighthouse[1730]: Jan 17 12:04:30.412 INFO Internal shutdown received              reason: Failed to start validator client

Version

1.0.6

Present Behaviour

Validator exits

Expected Behaviour

Validator starts and validates on the network

@michaelsproul
Copy link
Member

As discussed on Discord, I think this is a bug, as the validator definitions YAML ended up empty

I think a race condition like this is possible:

  1. Lighthouse truncates file upon opening it in write mode (need to verify where this happens)
  2. Disk is filled independently
  3. Lighthouse fails to write the YAML because the disk is now full

@maa-x
Copy link
Author

maa-x commented Jan 17, 2021

Solution that worked for me:

Re-import validator keys, something like this depending on your setup:

$ lighthouse --network mainnet account validator import --directory $HOME/eth2deposit-cli/validator_keys --datadir /var/lib/lighthouse

But note that you will have to edit the outputted validator_definitions.yml to add your password (https://lighthouse-book.sigmaprime.io/validator-management.html).

@paulhauner
Copy link
Member

I believe I've seen this before as well. I'm going to flag it as a bug.

@paulhauner paulhauner added the bug Something isn't working label Jan 17, 2021
@paulhauner paulhauner changed the title Validator refusing to start - Unable to open or create validator definitions Contents of validator defs. file is deleted Jan 17, 2021
@michaelsproul michaelsproul self-assigned this May 8, 2021
@michaelsproul
Copy link
Member

This just happened to another user on Discord. I'll spend a few hours on Monday to get it fixed.

bors bot pushed a commit that referenced this issue May 12, 2021
## Issue Addressed

Closes #2159

## Proposed Changes

Rather than trying to write the validator definitions to disk directly, use a temporary file called `.validator_defintions.yml.tmp` and then atomically rename it to `validator_definitions.yml`. This avoids truncating the primary file, which can cause permanent damage when the disk is full.

The same treatment is also applied to the validator key cache, although the situation is less dire if it becomes corrupted because it can just be deleted without the user having to reimport keys or resupply passwords.

## Additional Info

* `File::create` truncates upon opening: https://doc.rust-lang.org/std/fs/struct.File.html#method.create
* `fs::rename` uses `rename` on UNIX and `MoveFileEx` on Windows: https://doc.rust-lang.org/std/fs/fn.rename.html
* UNIX `rename` call is atomic: https://unix.stackexchange.com/questions/322038/is-mv-atomic-on-my-fs
* Windows `MoveFileEx` is _not_ atomic in general, and Windows lacks any clear API for atomic file renames :(
   https://stackoverflow.com/questions/167414/is-an-atomic-file-rename-with-overwrite-possible-on-windows

## Further Work

* Consider whether we want to try a different Windows syscall as part of #2333. The `rust-atomicwrites` crate seems promising, but actually uses the same syscall under the hood presently: untitaker/rust-atomicwrites#27.
@michaelsproul
Copy link
Member

Closed by #2338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A0 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants