Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Make MRenclave measurement independent of file options appearing in the Graphene manifest #2208

Closed
prakashngit opened this issue Mar 4, 2021 · 31 comments

Comments

@prakashngit
Copy link

prakashngit commented Mar 4, 2021

This is a feature request for supporting extended measurements via SGX's CONFIGID, (instead of measuring user trusted/allowed files via MRenclave.)

Context

In applications such as Federated learning, one to would like multiple data owners to execute the same training algorithm but with different datasets. We use graphene to run the training algorithm. In terms of the manifest, there is a base manifest that gets shared with all data owners. Each data owner then adds the list of data-files that it will use for training to the base manifest before building the graphene image.
Ideally, the data owner should list these data-files as trusted files on the manifest so that files used for training get pre-committed before the training task starts. However, in this case, all data owners will have different MRenclaves, and it becomes hard to use the MRenclave to verify that all data-owners are executing the same training algorithm.

Our current workaround is to mount a common folder as an allowed folder (and use wildcards as well) so that all manifests and hence MRenclaves are the same. The challenge with this workaround it becomes the application's responsibility to capture measurements of files before graphene deployment & verify integrity when files are loaded within graphene (since we longer can make use of the trusted file option).

Feature Request

From @dimakuv I learn that it is possible to separate measurements into two fields, 1) "Basic" measurement captured via MRENCLAVE and 2) "Extended" measurements (measuring list of trusted/allowed/protected files, etc) captured via CONFIGID.

It would be great for our use case (and I guess also a number of other data analytics use cases) if the above feature gets added to Graphene. If it is already part of the plan, could I kindly ask for a timeline of when this feature can be expected?

Thanks!
Prakash

@tonyreina
Copy link

This would be an excellent feature. As Prakash mentioned, the primary use case is for federated learning. In this case, there would be an enclave on different remote nodes. Each enclave would run the same software (DL model training), but the data on each node would be different. The participants have to register their dataset hash prior to the federation even being proposed so we know the expected hash; but the worry is that as the process runs in the enclave that the data owner could change the data after its hash has been confirmed.

So the question boils down to: Is there an easy method to make sure that the files in a trusted directory (outside of the enclave) don't change while the enclave exists? Or, do we need to explicitly check the data against the hash every time we load the files.

@mkow
Copy link
Member

mkow commented Mar 4, 2021

Why not use protected files for this? See https://graphene.readthedocs.io/en/latest/manifest-syntax.html#protected-files. This is the intended way of shipping additional files at runtime.

@tonyreina
Copy link

Do protected files get encrypted and loaded into the enclave? I guess what we're worried about is that this is typically the case where there is a directory of maybe a few thousand files with each file being maybe 100 MB in size. So I'm just wondering about performance issues for speed.

@prakashngit
Copy link
Author

Why not use protected files for this? See https://graphene.readthedocs.io/en/latest/manifest-syntax.html#protected-files. This is the intended way of shipping additional files at runtime.

Thank you for the question. In addition to performance hit that @tonyreina alluded to, I am afraid protected files fundamentally do not solve the problem at hand. Here is why:

The data at rest resides with the data-owner. No one outside the data owner node has access to the data. So we are not "protecting" data from the data-owner. It does not matter if the data at rest is encrypted or not. There is no privacy issue here. And the attack we are worried about is the data-owner itself providing incorrect data during training after committing to a dataset before training starts. So as far as I understand, to solve this there are only ways 1. Either list the files as trusted files in the manifest or 2. The application after loading the data computes the hash and checks if this matches with what the data owner committed to before in the first place. Whether the data at rest is encrypted or not does not affect the situation, in my opinion.

Thanks
Prakash

@dimakuv
Copy link

dimakuv commented Mar 8, 2021

To put it another way, I think what @prakashngit and @tonyreina want boils down to this:

  1. Some files are listed as sgx.trusted_files during enclave-creation time (before shipping the enclavized-app package to the remote untrusted machine).
  2. Some files are listed as sgx.more_trusted_files during enclave-startup time (after shipping the enclavized-app package to the remote untrusted machine).

There is no need to encrypt or somehow additionally protect the files from the second set. Whatever we do for sgx.trusted_files currently is sufficient for these sgx.more_trusted_files too. The only difference is that this set is decoupled from the "main" manifest file but is still reflected in the SGX remote attestation.

Of course, one can write her own code plugged into Graphene to implement this in an ad-hoc way. But this requires non-trivial implementation effort, and it makes sense to incorporate such code in mainline Graphene.

Also, this is exactly the scenario envisioned for SGX's CONFIGID (see e.g. openenclave/openenclave#3054). So it makes sense to implement some default behavior for this in Graphene.

@mkow
Copy link
Member

mkow commented Mar 8, 2021

Do protected files get encrypted and loaded into the enclave? I guess what we're worried about is that this is typically the case where there is a directory of maybe a few thousand files with each file being maybe 100 MB in size. So I'm just wondering about performance issues for speed.

Yes, they are encrypted at rest and decrypted + integrity-checked inside enclave memory. The decryption uses HW acceleration, so it should be quite fast (but still, a small overhead may be seen).

The data at rest resides with the data-owner. No one outside the data owner node has access to the data. So we are not "protecting" data from the data-owner. It does not matter if the data at rest is encrypted or not. There is no privacy issue here. And the attack we are worried about is the data-owner itself providing incorrect data during training after committing to a dataset before training starts. So as far as I understand, to solve this there are only ways 1. Either list the files as trusted files in the manifest or 2. The application after loading the data computes the hash and checks if this matches with what the data owner committed to before in the first place. Whether the data at rest is encrypted or not does not affect the situation, in my opinion.

Thanks for the explanation, I think I now see what's your case exactly. I proposed protected files, because it's a superset of trusted files - it also does the integrity check and allows shipping files at runtime. The encryption is there and doesn't hurt, but you also get integrity in the package.

Anyways, I'd suggest either using protected files or just listing everything as trusted in the manifest.

It seems to me that you treat signing/verifying Graphene enclave and signing/verifying the input for it as completely separate things which can be done by different entities. Unfortunately, from what I know this doesn't make sense, at least in case of Graphene - if you control the filesystem contents, then you also control Graphene code, so the first signature (the one certifying that you're indeed running a specific Graphene version) is meaningless, the filesystem changes may override it. Maybe it could be possible to separate both, but that will always be app-specific and very brittle from the security perspective.

So, my point is, you should think of input same as you think of code which runs inside SGX. You may just shift the signing of the enclave to the entity which provides input and you'll have the same security properties, but no need for two signatures.

@nsxfreddy
Copy link

The reason not to use protected files is that the data files in question may have use in addition to the enclave usage, but placing them in protected files requires a copy of that (large) dataset in order to encrypt it. The envisioned use case only requires integrity protection, so using protected files creates a large wasted space (100's GB or TB).

@mkow
Copy link
Member

mkow commented Mar 24, 2021

The reason not to use protected files is that the data files in question may have use in addition to the enclave usage, but placing them in protected files requires a copy of that (large) dataset in order to encrypt it. The envisioned use case only requires integrity protection, so using protected files creates a large wasted space (100's GB or TB).

Ok, that's a good argument against using protected files in this case.

Anyways, what about my last point? ("You may just shift the signing of the enclave to the entity which provides input and you'll have the same security properties, but no need for two signatures.")

@nsxfreddy
Copy link

We want the attestation to represent the algorithm being performed on the input data (including algorithmic defenses against the platform host manipulating the data) to provide assurance to other parties. If the data owner (which in FL is likely the platform host) signs the enclave, then verification of the attestation does not prove anything of value to the verifier(s) since they would need the data to reason about what the attestation represents. If instead the attestation reflects the algorithm (code) and a guarantee that the algorithm faithfully prevents the data modification, a verifier can proceed to reason about the algorithm independently of the data. As a bonus all enclaves are the same at each endpoint.

@mkow
Copy link
Member

mkow commented Mar 24, 2021

Ok, thanks for the explanation. This is exactly what I was afraid that you want to achieve with the solution initially suggested by @prakashngit. In short, this doesn't work this way and the guarantees you're expecting aren't actually provided by @prakashngit's design.

The main problem is, that if you allow the data owner to define the contents of the filesystem, then they can easily take over the control of the whole enclave and e.g. overwrite the algorithm code (in runtime, so it wouldn't be reflected in MRENCLAVE), rendering the first measurement meaningless. Simple example: data owner plants a dynamic library (.so) in a place where it will take precedence over the intended one and this way executes their own code. And as you said, the "algorithm owner" doesn't have the input data, so they can't really verify if anything suspicious was added (quoting: "since they would need the data to reason about what the attestation represents").

If instead the attestation reflects the algorithm (code) and a guarantee that the algorithm faithfully prevents the data modification, a verifier can proceed to reason about the algorithm independently of the data.

So, because of what I said above, this is not possible, at least in general case.

We could try to rescue this idea by limiting what can be added by the data owner, but this is highly app-specific and will almost always result in insecure setups, if left to the users. Example: a user wants to create a TF-as-a-service and allows others to provide TF models and data. The TF engine would go into MRENCLAVE, and the input would go into that second measurement. But what they don't know, is that TensorFlow SavedModel format wasn't designed to handle untrusted data, and if you control SavedModel data you can execute arbitrary code, thus breaking the TF-engine-provider assumptions.

There are actually quite few models where this is secure, e.g. you could limit the data-owner powers to a specific directory with a specific file name patterns and then teach users which formats are secure and which aren't and ensure they load only the safe ones. But as said above, this makes the whole solution very app-specific and putting a lot of trust in its users, which most likely are not trained in security, especially in the tricky SGX threat model.

@g2flyer
Copy link

g2flyer commented Mar 25, 2021

My five cents: to me the main (actually, only) general requirement here is that you can reason as a verifier of an attestation compositionally/separately about (files of) separate components of your application and hence manage (e.g., distribute) root-of-trusts"/measurements independently (and potentially differently) for various components. E.g., In prakash's case it is about separating the application from the specific workload (so you can enforce that one party in the FL scenario uses consistently always the same dataset).

Note that crucially from a assurance perspective you want exactly the same behavior from graphene runtime in terms of file integrity verification as if all the files would be all defined in the same manifest! Of course, as already currently, as application developer and attestation verifier you have already very careful make sure which files have to measured and which values can be trusted and where you can trust the application to process safely untrusted files (i.e., what you have in sgx.allowed_files). In fact, you could see Prakash's case as one where you would have simply placed the datasets in sgx.allowed_files iff only a single invocation of graphene would have to be involved. But with multiple invocations, you need the "commitment" from sgx.trusted_files.

Picking up on what dmitrii mentioned above, one easy way to achieve above requirement would be to

  • define a second manifest file (type) which only inclues 'sgx.trusted_files',
  • have the (untrusted) graphene app loader include (optionally, when passed such a secondary manifest) a hash of it in CONFIGID when creating enclave and pass the filename to the enclave
  • have the graphene runtime, after parsing the standard manifest but before doing anything else,
    • read the "configid-manifest"
    • hash it and verify that the hash matches CONFIGID
    • read the 'sgx.trusted_files' and merge it with the list from the manifest
  • then proceed as is, i.e., all file integrity tests are as before

I think that should be easy to do and enables compositional reasoning of attestation measurements. I'm also convinced that Prakash's scenario is unlikely the only use-case which has such a requirement. Another cases might be just to separate the core graphene files (PAL, maybe libc) from the actual application files as this also could make devops and policy-management easier from a separation of concern perspective even in simple outsourcing use-cases.

@mkow
Copy link
Member

mkow commented Mar 25, 2021

to me the main (actually, only) general requirement here is that you can reason as a verifier of an attestation compositionally/separately about (files of) separate components of your application and hence manage (e.g., distribute) root-of-trusts"/measurements independently (and potentially differently) for various components.

How do you exactly define composition of the measurements? Because from what @nsxfreddy wrote (and from what @prakashngit explained to me in a call yesterday) the algorithm owner doesn't have the data which were added to the Graphene configuration, yet they want to be able to reason about the algorithm which is running there (so, they want to reason only from MRENCLAVE, and treat CONFIGID just as a black box commitment to something constant). And this is not possible, unless you restrict the added data very precisely (and usually in an app-specific way).

In fact, you could see Prakash's case as one where you would have simply placed the datasets in sgx.allowed_files iff only a single invocation of graphene would have to be involved. But with multiple invocations, you need the "commitment" from sgx.trusted_files.

The difference is that in the solution with allowed files the algorithm owner could verify what files exactly are allowed in the config (i.e. paths and names). If you allow data owner to add trusted files entries, you're out of control what they add to the filesystem.

Another cases might be just to separate the core graphene files (PAL, maybe libc) from the actual application files as this also could make devops and policy-management easier from a separation of concern perspective even in simple outsourcing use-cases.

For this you can just use hash of the .tar.gz of Graphene release which was included in the final image (which you need to have) and you'll have exactly the same security properties. You can't reason about what Graphene binaries are running inside the enclave without knowing the application files, because they could overwrite Graphene. You either know all the binaries upfront or you can't really reason what's running inside, as Graphene doesn't have an internal security sandbox.

Overall, I think it would be possible to add a very limited and restricted "measured mount" which would allow mounting a directory which would go into CONFIGID instead of MRENCLAVE, but that would require a lot of precautions and still would be very risky from security perspective. And this is already different from the design proposed in this issue, which assumed that data and code are separate and FS data can't take over code.

@g2flyer
Copy link

g2flyer commented Mar 25, 2021

How do you exactly define composition of the measurements? Because from what @nsxfreddy wrote (and from what @prakashngit explained to me in a call yesterday) the algorithm owner doesn't have the data which were added to the Graphene configuration, yet they want to be able to reason about the algorithm which is running there (so, they want to reason only from MRENCLAVE, and treat CONFIGID just as a black box commitment to something constant). And this is not possible, unless you restrict the added data very precisely (and usually in an app-specific way)

You wouldn't treat CONFIGID as a blackbox, only the particular value of the content of the filenames specified in the CONFIGID manifest. Of course for this to work you have to rely on the application to do proper input-validation of these files. The semantics of the overall attestation of course has to be application specific and there is always some dependencies between the manifests of the components (as it true for any composition). The key part is though that it makes it easier to reason and work with if you can decompose.

In fact, you could see Prakash's case as one where you would have simply placed the datasets in sgx.allowed_files iff only a single invocation of graphene would have to be involved. But with multiple invocations, you need the "commitment" from sgx.trusted_files.

The difference is that in the solution with allowed files the algorithm owner could verify what files exactly are allowed in the config (i.e. paths and names). If you allow data owner to add trusted files entries, you're out of control what they add to the filesystem.

Well, a data owner (or, in fact anybody ever running any enclave) of course can add arbitrary files with arbitrary paths, but the key part would be that as long as this would be visible in the atttestion (via MRENCLAVE and CONFIGID) and then a verifier simply can reject/ignore that enclave. The key-part for that to work, of course, are the steps in my proposal where the enclave verifies that the (hash of the) configid-manifest matches CONFIGID and then enforces the combined policy from both manifests as if they would have been in a single manifest

Another cases might be just to separate the core graphene files (PAL, maybe libc) from the actual application files as this also could make devops and policy-management easier from a separation of concern perspective even in simple outsourcing use-cases.

For this you can just use hash of the .tar.gz of Graphene release which was included in the final image (which you need to have) and you'll have exactly the same security properties.

Hmm, but as a verifier of an attestation there is no obvious relation of the hash of the tar.gz and the measurement? That said, you can of course achieve the same compositional logic i have outlined also without CONFIGID. Assuming the manifest is loaded and measured last into EPC, you could essentially compute the state of the hash-function leading to MRENCLAVE and based on that value and the manifest, a verifier could compute MRENCLAVE and verify not based on MRENCLAVE as ground truth but based on a this partial hash state and a constraints on manifest values. (In fact, https://arxiv.org/abs/2008.09501v1 does that for slightly different reasons). It's just that with CONFIGID this is much simpler.

You can't reason about what Graphene binaries are running inside the enclave without knowing the application files, because they could overwrite Graphene. You either know all the binaries upfront or you can't really reason what's running inside, as Graphene doesn't have an internal security sandbox.

I think there is a disconnect here: in my proposal the verifier knows exactly all the filenames and all their hashes, just for some she might not care about what the actual content is as she trusts the application to do proper input validation. So for prakash there is no need for a sandbox. Of course, there could be other applications, e.g., where your main application is an interpreter such as with faas or for block chain applications like in Private Data Objects, where you would want the application to be a sandbox. But again, it's requirements on your applications, it's not something graphene has to provide.

Overall, I think it would be possible to add a very limited and restricted "measured mount" which would allow mounting a directory which would go into CONFIGID instead of MRENCLAVE, but that would require a lot of precautions and still would be very risky from security perspective. And this is already different from the design proposed in this issue, which assumed that data and code are separate and FS data can't take over code.

I disagree, in my proposal there is nothing which distinguishes data from code from a graphene level (and practically speaking there is not much graphene can even do, given that any data essentially can be (interpreted) code). Of course, applications need to be super careful in how they write their manifest, but this is true even right now. I don't think anything in my proposal has fundamental changed that as the assurance provided in my proposal is equivalent to just having had all trusted files in a single manifest. Note that graphene really gives you (by necessity) only assurance that trusted files are guaranteed to be integrity protected but cannot say anything about the "goodness" of the file (e.g., non-buggy code, non-malicous data). The trust on the latter always will have to come from the application

@mkow
Copy link
Member

mkow commented Mar 27, 2021

I think there is a disconnect here: in my proposal the verifier knows exactly all the filenames and all their hashes

Yup, I think this was the confusion - I assumed that the algorithm owner doesn't know what was added to the manifest, only the resulting CONFIGID. I guess in the future I'll just ask for a detailed information and verification flows for such designs, will be harder to misinterpret than long discussions :)

I disagree, in my proposal there is nothing which distinguishes data from code from a graphene level [...]

I agree with your disagreement :) My comment (as noted above) assumed that the algorithm owner doesn't know the added entries.

So, if we assume that the algorithm owner can verify all the entries in the manifest (original + added by data owner) then I think it's fine from security / security-foolproofness perspective.

But... If both parties know all the entries, why bother with the split at all? (that's why I assumed that the algorithm owner doesn't know the added entries, and thus followed with my security concerns) Is it because you want to commit the files after the enclave was signed? But then, you could just either sign it on the data owner side or sign it after commitment - after all MRSIGNER is not really useful in this scenario if we already verify MRENCLAVE (MRENCLAVE is "stronger" than MRSIGNER).

For this you can just use hash of the .tar.gz of Graphene release which was included in the final image (which you need to have) and you'll have exactly the same security properties.

Hmm, but as a verifier of an attestation there is no obvious relation of the hash of the tar.gz and the measurement? [...]

I'd say it's just a tooling problem, which is quite easy to solve (much easier than implementing CONFIGID support). As noted, you need the final image of the enclave to reason about anything, and if you have it, it's quit easy to verify if it was generated from given Graphene binaries - for LibOS you check the hash of libsysdb.so in trusted files and for PAL you check the data in PAL address range in the image.

And maybe some explanation from my side why I'm pushing so hard to really understand what you're trying to achieve:

  • I want to protect the security-critical part of our codebase from any unneeded complexity.
  • I don't want to include features which are hard to use securely (i.e. as much security as possible should be guaranteed on Graphene level, and not pushed to the end users).
  • I want to understand how and for what Graphene is used and misused, to design things in the future accordingly.

@g2flyer
Copy link

g2flyer commented Mar 27, 2021

I'd say it's just a tooling problem,

This i completely agree (and have mentioned to Mona et al in the past) ...

which is quite easy to solve (much easier than implementing CONFIGID support).

... although that part is somewhat debatable. I don't think CONFIGID support should be really complicated but it is certainly also true that all the code already exists somewhere in graphene for a non-CONFIGID solution. All you would have have to do is refactor and re-use it so (a) the build returns the pre-manifest-load-into-epc hash state (in addition to, or instead of, MRENCLAVE) and (b) there would be a library with a function which computes MRENCLAVE given this pre-manifest-load-into-epc hash-state and a manifest as input. It does have the advantage that in particular it wouldn't affect anything in the trusted part of graphene (but on the other hand seems a bit less intuitive for folks who are used to MRENCLAVE and know KSS/CONFIGID).

@mkow
Copy link
Member

mkow commented Mar 27, 2021

So, do we agree that this whole issue is only about tooling, and KSS/CONFIGID doesn't give any more properties than we can have without it? So, the actual problem underneath this issue is "how can end users create enclaves compositionally"? (I'm not sure yet how to define precisely what "compositionally" means in this context, so that's open for input)

If so, then I think we should analyze potential solutions in two aspects:

  1. Usability/interface for the end users (which solutions will be easier to use, more intuitive, harder to use incorrectly, etc.).
  2. Implementation viability - how hard it is to implement and how it influences the codebase.

For 1. I think both solutions can be identical for the end users - whether we'll use CONFIGID or not, I wouldn't expect end users to know anything about it - they'll be just calling some CLI/library wrappers for this functionalities.
For 2. as you said, current codebase would be much more happy with the non-CONFIGID solution, because it would be just a simple Python script, without the need to modify trusted code.

@g2flyer
Copy link

g2flyer commented Mar 27, 2021

So, the actual problem underneath this issue is "how can end users create enclaves compositionally"?

It is rather the verification than the creation which is the issue.

@mkow
Copy link
Member

mkow commented Mar 29, 2021

Isn't this the same thing actually? Building an enclave is just simulating loading of binaries to calculate hashes, and I think for verification you usually want to do the same (get the claimed Graphene binaries, get the specific app version, build and compare hash). One disclaimer: this assumes that the enclave building is reproducible, but I think it is in Graphene?

Also, moving one step back, what about my question about the algorithm owner knowing the added manifest entries? (but not the data) If they know them, then the enclave can be easily built with just concatenation of the two entries lists (after verifying the contents of the latter). What's the problem with doing it this way?

@g2flyer
Copy link

g2flyer commented Mar 29, 2021

Isn't this the same thing actually? Building an enclave is just simulating loading of binaries to calculate hashes, and I think for verification you usually want to do the same (get the claimed Graphene binaries, get the specific app version, build and compare hash). One disclaimer: this assumes that the enclave building is reproducible, but I think it is in Graphene?

Reproducible build is a somewhat orthogonal argument to composition in verifcation i'm talking about, i.e., it is related to how you get confidence in various binary code artifact (pal, lib, executable) vs whether the artifacts plus other values together make a meaningful manifest. It's the latter which i think it would be good to make composable and enable some divide'n'conquer approaches. Yes, with current tooling you could, if you have all binary artifacts and manifest input compute mrenclave using the existing graphene tooling to verify whether an MRENCLAVE is "good" even though you didn't know apriori all "ingredients". However, if some values are relatively dynamic, that doesn't really work well: (a) It is also really all-or-nothing in terms of "ingredients" you need to create an MRenclave candidate to verify against during validation, which is neither efficiently nor very usable if you think from a programming perspective of this verification; and (b) more importantly, it doesn't work if the verification would happen inside an enclave as the tooling doesn't support that. I think my proposal above does address it, certainly for the two use cases mentioned, in a clean and simple way, e.g., hidding all gory details, many of them relying on graphene internals which might change, in a simple api which can be used during verification (i.e., my point (b))

Also, moving one step back, what about my question about the algorithm owner knowing the added manifest entries? (but not the data) If they know them, then the enclave can be easily built with just concatenation of the two entries lists (after verifying the contents of the latter). What's the problem with doing it this way?

See above (essentially, easy automation/programmability and support inside enclaves for the case the "ingredients" are not all more or less static and defined in "human time")

@dimakuv
Copy link

dimakuv commented Mar 30, 2021

@prakashngit @g2flyer @mkow What about a completely different approach?

With Graphene, we've been thinking of having a central entity that would simplify SGX remote attestation and secret provisioning. Currently, we don't have a good attestation story when there is a cluster of SGX enclaves working towards one goal. E.g., the Federated Learning case has a bunch of loosely coupled SGX enclaves (on different machines, probably in different data centers), and the SGX remote attestation becomes a deployment/verification nightmare.

So for a cluster of SGX enclaves, it makes sense to have a single entity that manages remote attestation and secret/key provisioning. The end users do not perform SGX attestation of separate enclaves but only perform SGX attestation and verification of a single "attestation service". And this "attestation service" is bootstrapped with a policy file that contains all the measurements/policies for each of the SGX enclaves. There are already multiple centralized attestation services existing (or announced): Microsoft Azure Attestation, SCONE CAS, Fortanix Confidential Computing Manager, etc.

And it seems that there is already an open-source attestation service that may fit well: Marblerun Coordinator from Edgeless Systems. See https://www.youtube.com/watch?v=e_7q1uOpCqw and https://github.com/edgelesssys/marblerun. In particular, Marblerun Coordinator is:

  1. Verifiable by end users (at least in theory, the Coordinator must run inside an SGX enclave, be independently audited, etc.)
  2. Bootstrapped with a policy file (called "marblerun manifest") -- this file is known to and agreed upon by all end users; this file contains enclaves' measurements and possibly other, more sophisticated rules
  3. Serves as a root CA -- the Coordinator generates X.509 certificates for SGX enclaves (after the initial SGX attestation of these enclaves), so that the end users may communicate with the enclaves without the need for SGX attestation

Now applying this "centralized attestation service" to our use case of Federated Learning:

  1. All members of the Federated Learning consortium (data owners, devops, ...) agree on a single policy file
  2. This policy file contains the "base" measurement (what we called the first, base Graphene manifest in the above discussions)
  3. This policy file also contains the sgx.trusted_files hashes (what we called the second, extended Graphene manifest) which are released to SGX enclaves upon successful SGX attestation
  4. When each data owner starts her SGX enclave and goes through SGX remote attestation with the Coordinator, the Coordinator sends back the secrets, including the sgx.trusted_files hashes (this all happens in the pre-main phase)
  5. The data owner's SGX enclave appends these sgx.trusted_files hashes to the already-loaded manifest (in enclave memory, no need to do anything with actual manifest files)
  6. Now there is no need for the second manifest, additional verifications from other members, CONFIGID, etc.

I believe this Coordinator ("centralized attestation service") approach also solves the other problem of @prakashngit -- #2243 ("Sign the Cert by an external Authority rather than Self Signed"). Since the Coordinator servers as a root CA (or maybe intermediate CA, if this is needed), there is no need to send RA-TLS-enhanced certificates to end users, but instead these can be normal classic X.509 certificates.

What do you think about this approach?

P.S. One obvious drawback of this approach is centralization -- now there is a Single Point of Failure: the Coordinator. On the other hand, it is much-much easier for deployment and attestation/verification.

@monavij
Copy link

monavij commented Mar 30, 2021

Thanks a lot Dmitii for summarizing this discussion. I think this approach for close integration of Graphene with Marblerun can serve a number of use cases even outside of K8/Service mesh deployments of confidential compute applications with Graphene. I am tagging Felix and Moritz from edgeless here as well @flxflx and @m1ghtym0

@mkow
Copy link
Member

mkow commented Mar 30, 2021

Let me answer @g2flyer first, then I'll comment on Dmitrii's and Mona's proposal in a separate comment.

Reproducible build is a somewhat orthogonal argument to composition in verifcation i'm talking about, i.e., it is related to how you get confidence in various binary code artifact (pal, lib, executable)

Sorry, I wasn't clear, by "reproducible build" in this context I meant only the final step, the enclave "assembling" from binaries and configs.

vs whether the artifacts plus other values together make a meaningful manifest. It's the latter which i think it would be good to make composable and enable some divide'n'conquer approaches.

So, I meant +/- this part :) Although I'm not sure what you mean by "making a meaningful manifest" - do you want Graphene tooling to safe-check the interactions between components assembled? Like checking if the added manifest entries can overwrite Graphene or something similar?
And this is the step we need to assume that is reproducible. It probably is, but we haven't ever verified this thoroughly.

Yes, with current tooling you could, if you have all binary artifacts and manifest input compute mrenclave using the existing graphene tooling to verify whether an MRENCLAVE is "good" even though you didn't know apriori all "ingredients". However, if some values are relatively dynamic, that doesn't really work well:

(a) It is also really all-or-nothing in terms of "ingredients" you need to create an MRenclave candidate to verify against during validation, which is neither efficiently nor very usable if you think from a programming perspective of this verification

Can you give me an example where something less than all makes sense? In the design discussed in this thread it seems that both sides know all the entries in the final manifest (one just doesn't know the file contents, but they aren't directly part of MRENCLAVE).

What current tooling is for sure missing is a way to stop it from calculating the hashes for those trusted files which have hashes added manually to the manifest (but that would be a trivial change).

and (b) more importantly, it doesn't work if the verification would happen inside an enclave as the tooling doesn't support that. I think my proposal above does address it, certainly for the two use cases mentioned, in a clean and simple way, e.g., hidding all gory details, many of them relying on graphene internals which might change, in a simple api which can be used during verification (i.e., my point (b))

Good point, I think the verification tooling should be written in no-stdlib C then, so that you can use it wherever you want. Ad your proposal, see below.

See above (essentially, easy automation/programmability and support inside enclaves for the case the "ingredients" are not all more or less static and defined in "human time")

Hmm, but as I asked above in this very comment, could you show an example of a model/flow in which this makes sense? It seems to be that in the initial problem which was presented in this issue both parties know all the data which go directly into MRENCLAVE, so there's no need for these partial MRENCLAVEs at all (the idea itself sounds fine, although as with configid, I don't see a workflow/use case for them which doesn't reduce into a simple recalculation of MRENCLAVE over all the binaries).

@mkow
Copy link
Member

mkow commented Mar 30, 2021

My comments on Dmitrii's and Mona's proposal below.

Sounds good, but I think it requires these few points to which we already concluded with @g2flyer, that are needed also for @prakashngit use case, which are:

  • Graphene enclave assembling (from already built binaries) must be reproducible.
  • It needs to be moved into a library which could be used from inside an enclave (I think you silently assumed this in your approach, as it seems to me that Coordinator has to be able to compute MRENCLAVE from binaries).
  • It has to support manually inserted trusted hashes.

Now applying this "centralized attestation service" to our use case of Federated Learning:

  1. All members of the Federated Learning consortium (data owners, devops, ...) agree on a single policy file

  2. This policy file contains the "base" measurement (what we called the first, base Graphene manifest in the above discussions)

  3. This policy file also contains the sgx.trusted_files hashes (what we called the second, extended Graphene manifest) which are released to SGX enclaves upon successful SGX attestation

  4. When each data owner starts her SGX enclave and goes through SGX remote attestation with the Coordinator, the Coordinator sends back the secrets, including the sgx.trusted_files hashes (this all happens in the pre-main phase)

  5. The data owner's SGX enclave appends these sgx.trusted_files hashes to the already-loaded manifest (in enclave memory, no need to do anything with actual manifest files)

  6. Now there is no need for the second manifest, additional verifications from other members, CONFIGID, etc.

I think I got lost here. In 3. you mention "what we called the second, extended Graphene manifest" as something which is known to the Coordinator, but is a secret for the Data Owner. But originally this was known to both sides (and that's why I couldn't understand the split of the manifest into two parts, which I still don't understand why would ever be needed).

Overall this proposal makes sense, but I think it solves a different problem - how to easily manage secret bootstrapping into enclaves at scale. The problem which @prakashngit had was that the contents of some trusted files can be unknown to the Algorithm Owner and Coordinator, but I believe this whole issue reduces to just making our attestation tools work with file hashes instead of file data (right now it forcibly calculates all trusted hashes itself).

@monavij
Copy link

monavij commented Mar 31, 2021

@mkow While Dmitrii used the term secret as in those hashes will be passed to enclave over a secure channel during the phase after attestation that is typically called secret provisioning. We need to get those hashes inside the enclave (old proposal suggested using config ID). We are just proposing sending over a secure channel as those hashes do need to be registered with the coordinator - both side know these and they are not secret for enclave.

I also agree that our initial idea is to just move towards this overall approach to easy manage attestation and secret bootstrapping. Our general impression is that it will address both the requests that Prakash has for FL use case by various policies exposed by the coordinator. But we need to work though those scenarios in a bit more detail.

@dimakuv
Copy link

dimakuv commented Mar 31, 2021

  • Graphene enclave assembling (from already built binaries) must be reproducible.

Yes. I think it is. By "verification" of this, do you mean just a thorough code review of our sgx_framework.c file?

  • It needs to be moved into a library which could be used from inside an enclave (I think you silently assumed this in your approach, as it seems to me that Coordinator has to be able to compute MRENCLAVE from binaries).

I don't see a need for this with the Coordinator approach:

  • All Graphene SGX enclaves are built normally. In the Federated Learning case, these enclaves consist of: Graphene libs, Federated Learning base code, training algo code. In other words, all enclaves have the same MRENCLAVE, known to all parties.
  • The Coordinator has its own "configuration file". This file is largely just [ExpectedMeasurementEnclaveA = <this common MRENCLAVE value>, ExpectedMeasurementEnclaveB = <this common MRENCLAVE value>, ...].
  • The Coordinator's configuration file also has something like: [EmitStringToEnclaveAIfSuccessfulAttestation = "sgx.trusted_checksum.data = <dataForEnclaveA hash>", EmitStringToEnclaveBIfSuccessfulAttestation = "sgx.trusted_checksum.data = <dataForEnclaveB hash>", ...].
  • This configuration file is known to all Federated Learning members. The members can verify the correctness of the Coordinator and the associated configuration file via SGX remote attestation.

...But in general yes, it would be nice to have a standalone nostdlib C library that calculates MRENCLAVE and other measurements in Graphene. Currently we don't have in our sgx_framework.c file -- we only have it in our python/graphenelibos scripts.

  • It has to support manually inserted trusted hashes.

Again, I propose a slightly different change to Graphene -- Graphene must be able to augment the in-enclave TOML representation of the manifest with additional entries (like sgx.trusted_checksum) using a new interface, e.g. appending to the /dev/attestation/more_manifest_entries pseudo-file.

...But in general yes, this is a good feature to have and also trivial to add.

TLDR: With the Coordinator approach, we don't have a "second extra manifest" but instead have a "Coordinator configuration file" which contains all these extra details and emits them to Graphene instances upon startup.

In 3. you mention "what we called the second, extended Graphene manifest" as something which is known to the Coordinator, but is a secret for the Data Owner.

My wording was bad probably. I shouldn't have called it a "secret", it's just an extra piece of information for the base manifest. There is no secret in sgx.trusted_checksum, it is known to everyone -- all Data Owners as well as the Coordinator.

@flxflx
Copy link

flxflx commented Apr 1, 2021

The approach you sketched makes sense to me @dimakuv. I think it is a good use case for Marblerun. In general, we're happy to help with integrating Graphene and Marblerun.

P.S. One obvious drawback of this approach is centralization -- now there is a Single Point of Failure: the Coordinator. On the other hand, it is much-much easier for deployment and attestation/verification.

True. To reduce the risk, we plan to replicate the Coordinator using Raft (via etcd) in the future.

@monavij
Copy link

monavij commented Apr 1, 2021

@flxflx Great to hear that you are going to look at Raft support for coordinatore. That is definitely an area where we can work together.

One other idea I have is that marblerun to explore integrating with Azure MAA and Azure Key vault and use attestation/key provisioing services provided by Azure instead of Marblerun. This way anyone integrating with Marblerun will also automatically get integration with Azure. What are your thoughts on that?

@flxflx
Copy link

flxflx commented Apr 1, 2021

That's an interesting thought. We have MAA support on the roadmap but haven't spent much thought on AKV. There primary reason being that AKV seems to rely on HSMs without RA. Thus, establishing a secure channel (in the CC sense) between AKV and enclaves doesn't seem possible. But I may be mistaken here.

@mkow
Copy link
Member

mkow commented Apr 5, 2021

@dimakuv:

  • Graphene enclave assembling (from already built binaries) must be reproducible.

Yes. I think it is. By "verification" of this, do you mean just a thorough code review of our sgx_framework.c file?

Yes. But the fact that the C loader and the Python signer give the same measurements is a bit reassuring here.

  • It needs to be moved into a library which could be used from inside an enclave (I think you silently assumed this in your approach, as it seems to me that Coordinator has to be able to compute MRENCLAVE from binaries).

I don't see a need for this with the Coordinator approach:

* All Graphene SGX enclaves are built normally. In the Federated Learning case, these enclaves consist of: Graphene libs, Federated Learning base code, training algo code. In other words, all enclaves have the same MRENCLAVE, known to all parties.

* The Coordinator has its own "configuration file". This file is largely just `[ExpectedMeasurementEnclaveA = <this common MRENCLAVE value>, ExpectedMeasurementEnclaveB = <this common MRENCLAVE value>, ...]`.

* The Coordinator's configuration file also has something like: `[EmitStringToEnclaveAIfSuccessfulAttestation = "sgx.trusted_checksum.data = <dataForEnclaveA hash>", EmitStringToEnclaveBIfSuccessfulAttestation = "sgx.trusted_checksum.data = <dataForEnclaveB hash>", ...]`.

* This configuration file is known to all Federated Learning members. The members can verify the correctness of the Coordinator and the associated configuration file via SGX remote attestation.

...But in general yes, it would be nice to have a standalone nostdlib C library that calculates MRENCLAVE and other measurements in Graphene. Currently we don't have in our sgx_framework.c file -- we only have it in our python/graphenelibos scripts.

Ok, now that I re-read everything I see that it's not directly needed in this approach, as Coordinator doesn't need to calculate anything in runtime. But on the other hand (also, a bit off-topic to this whole discussion), this approach assumes that "all enclaves have the same MRENCLAVE, known to all parties" - in practice, all the parties will need a tool to actually verify this MRENCLAVE (i.e. take binary blobs of the claimed software and verify if it builds into the same enclave hash).

In 3. you mention "what we called the second, extended Graphene manifest" as something which is known to the Coordinator, but is a secret for the Data Owner.

My wording was bad probably. I shouldn't have called it a "secret", it's just an extra piece of information for the base manifest. There is no secret in sgx.trusted_checksum, it is known to everyone -- all Data Owners as well as the Coordinator.

Yup, now I get it, this "secret" misled me. Overall, this approach seems to me like circumventing the inability to easily calculate the updated MRENCLAVE - we just push some of the variables to runtime, to not have to calculate the updated MRENCLAVE, right?

If that's the case, wouldn't better tooling on our side solve this problem at the core? (as I proposed above) It wouldn't require any changes to Graphene internals (especially no new APIs) and I think it would solve all the problems at once. And it's still compatible with Marblerun, the difference would be that all the parties would calculate the MRENCLAVE with the appended manifest entries and use this as the expected MRENCLAVE.

@dimakuv
Copy link

dimakuv commented Apr 12, 2021

Overall, this approach seems to me like circumventing the inability to easily calculate the updated MRENCLAVE - we just push some of the variables to runtime, to not have to calculate the updated MRENCLAVE, right?

Yes, this is correct.

If that's the case, wouldn't better tooling on our side solve this problem at the core? (as I proposed above) It wouldn't require any changes to Graphene internals (especially no new APIs) and I think it would solve all the problems at once.

Yes, your proposed approach also works.

@mkow
Copy link
Member

mkow commented Sep 10, 2021

The valid part of this issue will be resolved when we implement gramineproject/gramine#11. I'm closing this one to keep the discussion in one place.

@mkow mkow closed this as completed Sep 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants