Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Compromised artifact file integrity can lead to disputes #677

Open
mrcnski opened this issue Mar 28, 2023 · 16 comments
Open

PVF: Compromised artifact file integrity can lead to disputes #677

mrcnski opened this issue Mar 28, 2023 · 16 comments
Assignees
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Mar 28, 2023

ISSUE

Overview

If we try to execute a corrupt PVF artifact it will likely lead to a dispute. See the comment below for an example of this.

A simple mitigation is to just put the hash of the file contents in the filename. Artifacts are not that big, on the order of several dozen MB (I think?). If we use something like https://github.com/BLAKE3-team/BLAKE3, which claims 6866 MiB/s throughput, then the time spent hashing would barely register.

Original comment

As far as I understand, the main concern about not clearing the artifact cache is:

  1. Unsafe functions are used to import the artifact
  2. Their safety guarantees rely on the fact the node has produces them with prepare() and they are not just some random files put there by someone.

Not that we can guarantee nobody will overwrite them during the node run time. I believe the cache pruning was implemented as a measure to mitigate those unwanted outcomes. But I don't have a strong opinion if it makes any sense. If someone wants to screw things up, he will.

_Originally posted by @s0me0ne-unkn0wn in #685

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 8, 2023

For 30-40mb artifacts (and they can get much larger), 6866 MiB/s would take about 4-6 ms. This is negligible compared to preparation, which is on the order of seconds, but not compared to execution which is on the order of 10 ms. Considering that such disk corruptions are quite rare, this hardly seems worth checking before every execution.

So probably the most realistic solution here is to hash after preparation, but only check it on node restart and not before every execution.

@s0me0ne-unkn0wn suggested keeping prepared artifacts in RAM only, which would mitigate the issue of disk/FS corruptions. At first blush it doesn't seem feasible, especially with on-demand coming, but it could be worth considering...

@s0me0ne-unkn0wn
Copy link
Contributor

If you're up to some experiment you can try to compress them with sp_maybe_compressed_blob::compress() and see if it makes any good results 🙂

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 8, 2023

@s0me0ne-unkn0wn Hmm that would add decompression overhead to each execution, definitely worth considering though.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. T8-parachains_engineering and removed J7-mentor labels Aug 25, 2023
@s0me0ne-unkn0wn
Copy link
Contributor

Not sure I follow. You mean, hash a prepared artifact right after the preparation, include its hash into its name, and check the hash before execution? But how do you know the name of the artifact you're going to open for execution, then? Sounds like a chicken and egg situation.

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 10, 2023

Not sure I follow. You mean, hash a prepared artifact right after the preparation, include its hash into its name, and check the hash before execution? But how do you know the name of the artifact you're going to open for execution, then? Sounds like a chicken and egg situation.

I was thinking on node startup, we just list all the artifacts in the cache path, check consistency with the hash, and load them into the artifacts table? As mentioned above, hashing before each execution would be too much of a performance hit.

@s0me0ne-unkn0wn
Copy link
Contributor

Aha, thanks for the elaboration, that does make sense. But that would protect from accidental artifact corruption but not from malicious action. On the other hand, I don't see any easy way to protect against malicious artifacts while still keeping them on disk anyway. (Sign them with the validator key maybe? But that would require some effort)

@burdges
Copy link

burdges commented Oct 10, 2023

How big are artifacts? A few megabytes? blake2 claims 3.08 cycles per byte, so 2/1000th of a second? You;d need to load the PVF only once of course, but that'd avoid a race condition too.

It's otoh just one of many reasons validataors need to avoid their machines being compromised, so whatever.

@eagr
Copy link
Contributor

eagr commented Oct 11, 2023

Please help me understand. What would happen if someone deliberately swaps the cache after preparation? And what if they do that in a large scale? Is there any incentive of doing that?

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 11, 2023

How big are artifacts? A few megabytes? blake2 claims 3.08 cycles per byte, so 2/1000th of a second? You;d need to load the PVF only once of course, but that'd avoid a race condition too.

@burdges I did a quick analysis in #677 (comment). Using the published numbers of blake3, there would be a ~5ms performance hit which I'm not sure is acceptable in PVF execution.

What do you mean by a race condition? Loading all the artifacts once into memory is an idea we had, but probably not feasible with on-demand.

It's otoh just one of many reasons validataors need to avoid their machines being compromised, so whatever.

Yeah, though a malicious attack in this way seems really unlikely now that the PVF processes have filesystem sandboxing. They are only able to access the artifacts they need to. My main concern is avoiding unneeded disputes due to a hardware failure (corrupted disk) - but if that happens while the node is running, there's probably not much we can do. OTOH, checking the hashes on node startup is acceptable and may bring some small benefit for no real cost.

Please help me understand. What would happen if someone deliberately swaps the cache after preparation? And what if they do that in a large scale? Is there any incentive of doing that?

@eagr The validator would vote wrongly, so I don't think there's any incentive to doing that deliberately or a possible attack vector there. They would just get slashed.

@burdges
Copy link

burdges commented Oct 11, 2023

My main concern is avoiding unneeded disputes due to a hardware failure (corrupted disk)

These cause invalid disputes, which disables the validator. We need not punish those too heavily, unless correlated across validators, so maybe the solution is just to take this into consideration in the slashing level?

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 11, 2023

@burdges Related to that, another idea we had with @eskimor is to only hash the file if a candidate is found to be invalid, after the fact. Reason being that this is a rare case so this additional check shouldn't affect performance too much. If an artifact was corrupted it is very unlikely to be found valid, so we shouldn't need to check file integrity in that case.

@s0me0ne-unkn0wn
Copy link
Contributor

@mrcnski sounds like a really good idea!

@burdges
Copy link

burdges commented Oct 11, 2023

It's actually more important that the validator be hard to hack, but yeah sure.

@eagr
Copy link
Contributor

eagr commented Nov 28, 2023

I guess what's left is to check integrity on execution errors and act accordingly?

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 28, 2023

Yes, I filed a follow-up here: #2399 (comment) Please coordinate with @jpserrat as he commented there, I'm not sure if he started on this yet or not!

@burdges
Copy link

burdges commented Aug 28, 2024

Yes blake3 is fast, but it's speed beyond blake2 comes from merkle trees being more parallelizable, including both SIMD and multiple threads. The SIMD is a clear win, but polkadot should be using those other CPU cores, although blake3 improves cache locallity maybe.

Anyways, we might still prefer to do this check only when some validation fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be.
Projects
Status: In Progress
Status: In Progress
Development

No branches or pull requests

5 participants