-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encrypted blob store repository #50846
Encrypted blob store repository #50846
Conversation
FYI I have made the following changes to the main code, besides the changes suggested above in the review comments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @albertzaharovits I found a few more things (still not a full review pass I'm afraid but I'll get to it shortly :)). The build/license situation is a problem I think though and it's probably good to address/refactor that asap as this may be a little tricky to get right.
for (int i = 0; i < 32; i++) { | ||
names.add("test-repo-" + i); | ||
} | ||
repositoryNames = Collections.synchronizedList(names); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doe this list have to be synchronized
? No right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, because the test methods in the class can run in parallel and there should not be two tests that use the same repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah I don't think we can have two test methods from the same class run concurrently in the same JVM. How would that work with the single static reference to the internal node/cluster and its cleanup after every test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I don't know what made me believe otherwise.
@@ -54,6 +54,8 @@ dependencies { | |||
compile 'com.google.apis:google-api-services-storage:v1-rev20190426-1.28.0' | |||
|
|||
testCompile project(':test:fixtures:gcs-fixture') | |||
// required by the test for the encrypted GCS repository | |||
testCompile project(path: ':x-pack:plugin:repository-encrypted', configuration: 'testArtifacts') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not the right person to judge this but this pattern in the build seems troubling to me. We are using non-Apache licensed code in the tests of Apache licensed code now. Is that ok? Maybe it's better to run these tests in a separate module downstream?
You are definitely using this dependency in Apache licensed files, this seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed this synchronously, and also the tests requiring this might not be relevant anymore, but I'm presenting my counter for posterity
I see your point, and I didn't look at it like this. I had minor qualms about having tests for non-Apache licensed code as Apache licensed. The fact that we pull non-Apache test artifacts is a slightly different point.
But, I tend to dismiss the Apache purists that get offended if the tests require an artifact dependency which is not Apache licensed. It feels squarely like a mean-intending exaggeration; If we don't publish the tests jar, they should not even notice.
When I went on this path, I worried that the reverse alternative (having the encrypted repo depend in tests on the cloud repository) would entail more project setup boilerplate code. I'm not so sure about it right now, and I think it is beneficial if we keep these tests inside the encrypted-repo module, for subjective reasons of code organization.
// blobs are larger after encryption | ||
Map<String, BlobMetaData> blobsWithSizeAfterEncryption = new HashMap<>(); | ||
blobs.forEach((name, meta) -> { | ||
blobsWithSizeAfterEncryption.put(name, new BlobMetaData() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Probably shorter to just do:
new PlainBlobMetaData(meta.name(), EncryptedRepository.getEncryptedBlobByteLength(meta.length());
here :)
server/src/test/java/org/elasticsearch/common/blobstore/BlobPathTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright finally had some quiet time to go over this in-depth :)
I think the main remaining point for me in production code is the delete efficiency. We shouldn't be adding new LIST operations to bulk deletes unless we absolutely can't avoid them. I put some more details on this in-line. But what it boils down to from my perspective is this:
We have blobs that can be overwritten and for those we need the meta-id at the beginning of the blob which makes deletes and reads a little tricky since we don't know the meta path straight from the blob itself.
The vast majority of blobs though (everything but meta- and snap-) is never ever overwritten and we don't need the meta-id magic. They can be deleted as efficiently as any other blob (or at least half as efficiently) because we could technically make the naming of the metadata fixed for them.
I know we already have the index.latest
special case, but right now special casing snap-
and meta-
blobs looks like the only way of achieving happiness here. We can then work on removing the possibility of overwriting these blobs from the core blob-store implementation (meta is already done pretty much and snap- are easy) and get rid of the complication with the meta id eventually.
final InputStream encryptedDataInputStream = delegatedBlobContainer.readBlob(blobName); | ||
try { | ||
// read the metadata identifier (fixed length) which is prepended to the encrypted blob | ||
final byte[] metaId = encryptedDataInputStream.readNBytes(MetadataIdentifier.byteLength()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: maybe just pre-allocate the array and read into it since this API can't be backported to 7.x
// read the metadata identifier (fixed length) which is prepended to the encrypted blob | ||
final byte[] metaId = encryptedDataInputStream.readNBytes(MetadataIdentifier.byteLength()); | ||
if (metaId.length != MetadataIdentifier.byteLength()) { | ||
throw new IOException("Failure to read encrypted blob metadata identifier"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this <
and make it clear in the exception message that the blob is corrupted/too-short?
(maybe we could technically get here when mounting an unencrypted repository? Should we mention the possibility of that in the exception message?)
// find all the blob names that must be deleted | ||
Set<String> blobNamesSet = new HashSet<>(blobNames); | ||
Set<String> blobNamesToDelete = new HashSet<>(); | ||
for (String existingBlobName : delegatedBlobContainer.listBlobs().keySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This we shouldn't do. We should just optimistically delete the blobs and assume they're all there. It's really costly to run this listing and it may not be accurate on S3 and would cause you to not try and delete a certain blob if it doesn't show up in the listing I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just optimistically delete the blobs and assume they're all there.
There is a problem with this approach:
For every existing data blob there must be a metadata blob already visible, because metadata blobs are written first. For every blob name, you must delete the meta and the data blobs. But then because the meta blobs are written first, it's possible that a deleteBlob
operation will delete just the meta and fail to delete the data because that's not visible yet, and the data blob might appear later.
This approach makes it sure that you only delete the meta of existing data blobs, so that no corruption is possible. Am I right?
it may not be accurate on S3 and would cause you to not try and delete a certain blob if it doesn't show up in the listing
I believe that the limitation of deleting only the blobs you can list is OK.
Let's discuss, maybe we can come up with a more performant alternative.
// Metadata deletes when there are multiple for the same blob is un-safe, so don't try it now. | ||
// It is unsafe because metadata "appears" before the data and there could be an overwrite in progress for which only | ||
// the metadata, but not the encrypted data, shows up. | ||
List<String> metadataBlobNamesToDelete = new ArrayList<>(blobNamesToMetadataNamesToDelete.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't matter IMO. On e.g. S3 we have no ordering guarantees at all anyway so we should just delete everything here. There is no point in trying to protect ourselves against concurrency issues at this layer. If the caller wants a certain blob name gone, the fact that someone may be writing this very same blob name right now is never an issue due to the repository consistency measures we have in place for blob naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is mostly to protect for the write after delete cases.
I understand this is not a concern because of the code higher up, but I would still have this protection.
I don't think discounting that case buys us anything wrt to efficiency with cloud API calls.
Map<String, List<String>> blobNamesToMetadataNamesToDelete = new HashMap<>(blobNamesToDelete.size()); | ||
Set<String> allMetadataBlobNames = new HashSet<>(); | ||
try { | ||
allMetadataBlobNames = encryptionMetadataBlobContainer.listBlobs().keySet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could become an issue as it makes deleting very expensive for large paths.
I think we must find a cleaner way of dealing with this. One solution I see would be to not use the metadata naming including a UUID for all blobs. We could just do it for snap-
and meta-
blobs (these are the only ones that can technically get overwritten). I know it's another special case, but this would allow us to simply delete all other blobs' meta without listing the metadata path. For the snap-
and meta-
blobs we can then do a listing including the prefix in the meta directory and it's effectively always a small listing.
WDYT?
Given some of the concerns enunciated above by Armin (i.e. the performance problem of deleting metadata) as well as other issues related to overwriting blobs, I've questioned again the mapping relationship of metadata to encrypted data. I think I've found a simpler and more efficient way to do that, and after discussing it with Armin, I've got his endorsement as well! The idea is to reuse the data encryption key for as many blobs as possible, not only a single one as it stands right now. In this case, on every node, the repository lazily generates the random DEK upon a blob write. It will then reuse the key, with monotonically increasing IVs, for every subsequent blob write. Given the The only drawback of this approach is tracing the inverse relationship from metadata to blobs. But, it's expected that there will be few metadata blobs (DEKs) for the entire lifetime of the repository, the order of @tvernum Please let me know what you think about this approach. I'm very excited to start implementing it ASAP. |
The relationship between the encrypted data and the encryption key is obviously critically important to the design of the encrypted repository. There are many constraints on the design, which cannot be discussed at length in a GH comment, but I believe briefly summarizing them is necessary to frame the thinking around this topic. In no particular order:
|
@tvernum and I brainstormed on the improvement options over #50846 (comment) for a way to track the blob containers that a key was used for (given the many blobs that can be encrypted with the same key, tracking blob containers is a better trade-off, especially because cleanup also works at the "indices" blob container level). But we've convened that an imperfect cleanup is something we can improve on later, possibly even acknowledging it on the docs. One breakthrough observation Tim made, is that the whole indirection story with the data encryption key (not using the repository password to generate the encryption key directly) is to have a better key rotation experience, where you don't have to decrypt-and-reencrypt the whole data. But the key rotation facilitated by the DEK indirection is itself an imperfect solution. That's because if the password has been compromised, before it was rotated, the attacker can decrypt the DEKs and will still have access to the data even after the rotation. Even though new data (new snapshots) will not be compromised, and compromised that might already be on the attacker's server, the user might still be surprised that the attacker can still decrypt the old data from before the key rotation after the rotation. So, decrypting-and-reencrypting the whole data once in a while, at large intervals of time (months to years), (by recreating the repository with the new password) might still be desirable, at which point accumulated encryption metadata garbage might cease to be a problem. |
After the above discussion, I got the idea of storing the encryption keys several times, creating redundant copies of the same key, once for every blob container (containing blobs that are encrypted by the said key). If the key copy is removed every time a blob container is removed, then all the key copies will be automatically removed when all the containers using the key are removed. This redundancy might prove reasonable wrt to space and blob count overhead, but might not work for key rotation. I need to think though it a bit more. |
Another consequence of this (that we should track somewhere) is that password-rotation should also trigger new DEKs so that we can guarantee that new snapshots will not be accessible by an attacker who previously exfiltrated the password and used it to retreive the existing DEKs. |
I have the rough plan for the strategy to reuse DEKs. As discussed, the DEKs will be reused locally by the node. A new DEK is generated upon repository creation, on the local node, and used for all the blobs encrypted under the repository. The DEK continues to be used for the lifetime of the repository. It can however be discontinued, and a new one generated, in the following cases: the IV space is exhausted, the usage log exceeds an internal size bound, or the key is used for more than a maximum allowed lifetime, measured in repository generations; the last two conditions are discussed later on. A snapshot references files across multiple nodes. Therefore, to facilitate the restore operation, the encrypted blob indicates which DEK must be used to decrypt it. This is achieved by prepending the name of the DEK (an UUID) to the encrypted content of the blob. Under normal operating circumstances, there will be a small number of active DEKs, one for every data and master nodes, which are reused for the lifetime of the nodes (the maximum number of encryptions with the same key is 2^32, or even 2^64). The inverse association, i.e. given a key trace back the blobs encrypted with it, is only required during the cleanup operation, so that the keys used for blobs no longer referenced by the existing snapshots be removed as well. The following explains how this association is maintained and which trade-offs have been assumed. Given the potentially large number of blobs, and the very small size of a key, as well as the existing resolution of the cleanup procedure (which only removes unreferenced root blobs and unreferenced indices blob containers), the association would actually record the index ids for every DEK. The cleanup uses this information to remove the DEKs which have touched indices which are not referenced anymore. More precisely, a given key is eligible for cleanup if its usage log does not contain "live" indices. But, there is an important corner case. The view of the usage log might not be complete, during a cleanup, because the node terminated abruptly (or the connection to the cloud service failed) and because of the dreaded "eventual consistency" issue. To circumvent this, the DEK has a maximum lifetime in terms of repository generations, and the usage log is considered complete after the DEK lifetime + some extra interval have been exceeded. |
@albertzaharovits the plan above sounds fine to me. I wonder though, is there any strong argument at this point against simply never deleting any metadata/keys? The storage cost seems trivial and I still don't see any other argument for not just keeping these around forever and thus the above plan seems to me like it would create non-trivial complexity in the implementation for no tangible benefit? |
I think it would be reasonable to accept that we don't delete DEKs at this point in time. However, much of the plan above is needed even in that case, so to make sure we're all on the same page when considering "work that could be avoided" if we never delete DEKs, I believe the rough outline of what it would mean is: We obviously need to refresh DEKs under some circumstances, due to the IV space. I think for good security measures we also ought to rotate them after some length of time (which can be measured by repo generations rather than clock time). And we want them to be in-memory only and tied to a single repo, so they also need to be regenerated on repository creation, and node startup. We would therefore also avoid writing the usage log when it was added to, since there would be no usage log. We would clearly also skip the cleanup steps, since we would have no usage log to compare to. Of the above, the cleanup steps seem like the larger amount/complexity of work and the usage log seems relatively small/simple. I would propose that we implement the usage log if we have time, but don't implement the cleanup steps. That feels like it saves most of the complexity of the implementation, but still leaves open the opportunity to implement it in the future if we find that our analysis of the overhad of orphaned DEKs was incorrect. |
@tvernum @original-brownbear I appreciate your input and thinking. Thank you! I will be raising a different PR, asap, analogous to this one (this one has too much complexity fluff associated with the one-to-one key to data mapping, and it could be useful for reference as it stands). It will contain the This conversation, made me think about another way of reusing DEKs. Please spend some thought on it as well. It has interesting trade-offs compared to the universal DEK reuse strategy we've been discussing: In a similar way, the DEKs are generated randomly and managed locally on the node (the same strategy to generate them when the repository starts, keep it in memory only, store it on the cloud metadata container encrypted with the repo password, and discontinue it when the IV space is exhausted or the node shuts down). The difference is that the node does not reuse DEKs across indices. More precisely, when DEKs are bound to be used only for blobs of a specific index id (which can be part of many snapshots) the name of the DEK blob can be made to contain that index's id. In this case, the cleanup is very simple and efficient. This is the main advantage of this design. The cleanup already computes the set of referenced "live" index ids, and the encrypted repo's cleanup, in addition, does a list of the DEKs container, and deletes the DEK blobs that, given their name, are not used for live indices. The only other blobs which are not part of an index are stored in the root container. The DEKs used to encrypt those, have a different naming scheme, and are cleaned up the hard way (read each live blob, identify the DEK) or not at all (they are few, and only "one" DEK is used at a time to encrypt them, because only the master writes at the root container). This alternative design is similarly difficult to code, compared to the usage log option, because it is easy to differentiate which blobs are written under an index id because this writing is done through a separate blob container reference. The main drawback is that this generates more DEKs, for every new index (worst case), even the smallest one. I'm not too worried about it though, the DEK generation burden is shared among the cluster nodes, and they can also be generated and pooled in a background thread. The larger number of DEKs also increases the cost of password rotation, although the comparison against the universally reusing DEKs option is difficult to do. To me, this sounds like a better alternative to both universally reusing DEKs and reusing DEKs with usage log. But it might just tickle the engineer in me which doesn't like "leaking" resources, more than adding practical benefits. Please let me know your thoughts! |
We've migrated away from this approach to the one from #53352 , hence closing. |
The client-side encrypted repository is a new type of snapshot repository that internally delegates to the regular variants of snapshot repositories (of types Azure, S3, GCS, FS, and maybe others but not yet tested). After the encrypted repository is set up, it is transparent to the snapshot and restore APIs (i.e. all snapshots stored in the encrypted repository are encrypted, no other parameters required). The encrypted repository is protected by a password stored on every node's keystore (which must be the same across the nodes). The password is used to generate a key encrytion key (KEK), using the PBKDF2 function, which is used to encrypt (using the AES Wrap algorithm) other symmetric keys (referred to as DEK - data encryption keys), which themselves are generated randomly, and which are ultimately used to encrypt the snapshot blobs. For example, here is how to set up an encrypted FS repository: ------ 1) make sure that the cluster runs under at least a "platinum" license (simplest test configuration is to put `xpack.license.self_generated.type: "trial"` in the elasticsearch.yml file) 2) identical to the un-encrypted FS repository, specify the mount point of the shared FS in the elasticsearch.yml conf file (on all the cluster nodes), e.g. `path.repo: ["/tmp/repo"]` 3) store the repository password inside the elasticsearch.keystore, *on every cluster node*. In order to support changing password on existing repository (implemented in a follow-up), the password itself must be names, e.g. for the "test_enc_key" repository password name: `./bin/elasticsearch-keystore add repository.encrypted.test_enc_pass.password` *type in the password* 4) start up the cluster and create the new encrypted FS repository, named "test_enc", by calling: ` curl -X PUT "localhost:9200/_snapshot/test_enc?pretty" -H 'Content-Type: application/json' -d' { "type": "encrypted", "settings": { "location": "/tmp/repo/enc", "delegate_type": "fs", "password_name": "test_enc_pass" } } ' ` 5) the snapshot and restore APIs work unmodified when they refer to this new repository, e.g. ` curl -X PUT "localhost:9200/_snapshot/test_enc/snapshot_1?wait_for_completion=true"` Related: #49896 #41910 #50846 #48221 #65768
The client-side encrypted repository is a new type of snapshot repository that internally delegates to the regular variants of snapshot repositories (of types Azure, S3, GCS, FS, and maybe others but not yet tested). After the encrypted repository is set up, it is transparent to the snapshot and restore APIs (i.e. all snapshots stored in the encrypted repository are encrypted, no other parameters required). The encrypted repository is protected by a password stored on every node's keystore (which must be the same across the nodes). The password is used to generate a key encrytion key (KEK), using the PBKDF2 function, which is used to encrypt (using the AES Wrap algorithm) other symmetric keys (referred to as DEK - data encryption keys), which themselves are generated randomly, and which are ultimately used to encrypt the snapshot blobs. For example, here is how to set up an encrypted FS repository: ------ 1) make sure that the cluster runs under at least a "platinum" license (simplest test configuration is to put `xpack.license.self_generated.type: "trial"` in the elasticsearch.yml file) 2) identical to the un-encrypted FS repository, specify the mount point of the shared FS in the elasticsearch.yml conf file (on all the cluster nodes), e.g. `path.repo: ["/tmp/repo"]` 3) store the repository password inside the elasticsearch.keystore, *on every cluster node*. In order to support changing password on existing repository (implemented in a follow-up), the password itself must be names, e.g. for the "test_enc_key" repository password name: `./bin/elasticsearch-keystore add repository.encrypted.test_enc_pass.password` *type in the password* 4) start up the cluster and create the new encrypted FS repository, named "test_enc", by calling: ` curl -X PUT "localhost:9200/_snapshot/test_enc?pretty" -H 'Content-Type: application/json' -d' { "type": "encrypted", "settings": { "location": "/tmp/repo/enc", "delegate_type": "fs", "password_name": "test_enc_pass" } } ' ` 5) the snapshot and restore APIs work unmodified when they refer to this new repository, e.g. ` curl -X PUT "localhost:9200/_snapshot/test_enc/snapshot_1?wait_for_completion=true"` Related: elastic#49896 elastic#41910 elastic#50846 elastic#48221 elastic#65768
The client-side encrypted repository is a new type of snapshot repository that internally delegates to the regular variants of snapshot repositories (of types Azure, S3, GCS, FS, and maybe others but not yet tested). After the encrypted repository is set up, it is transparent to the snapshot and restore APIs (i.e. all snapshots stored in the encrypted repository are encrypted, no other parameters required). The encrypted repository is protected by a password stored on every node's keystore (which must be the same across the nodes). The password is used to generate a key encrytion key (KEK), using the PBKDF2 function, which is used to encrypt (using the AES Wrap algorithm) other symmetric keys (referred to as DEK - data encryption keys), which themselves are generated randomly, and which are ultimately used to encrypt the snapshot blobs. For example, here is how to set up an encrypted FS repository: ------ 1) make sure that the cluster runs under at least a "platinum" license (simplest test configuration is to put `xpack.license.self_generated.type: "trial"` in the elasticsearch.yml file) 2) identical to the un-encrypted FS repository, specify the mount point of the shared FS in the elasticsearch.yml conf file (on all the cluster nodes), e.g. `path.repo: ["/tmp/repo"]` 3) store the repository password inside the elasticsearch.keystore, *on every cluster node*. In order to support changing password on existing repository (implemented in a follow-up), the password itself must be names, e.g. for the "test_enc_key" repository password name: `./bin/elasticsearch-keystore add repository.encrypted.test_enc_pass.password` *type in the password* 4) start up the cluster and create the new encrypted FS repository, named "test_enc", by calling: ` curl -X PUT "localhost:9200/_snapshot/test_enc?pretty" -H 'Content-Type: application/json' -d' { "type": "encrypted", "settings": { "location": "/tmp/repo/enc", "delegate_type": "fs", "password_name": "test_enc_pass" } } ' ` 5) the snapshot and restore APIs work unmodified when they refer to this new repository, e.g. ` curl -X PUT "localhost:9200/_snapshot/test_enc/snapshot_1?wait_for_completion=true"` Related: #49896 #41910 #50846 #48221 #65768
This builds upon the data encryption streams from #49896 to create an encrypted snapshot repository. The repository encryption works with the following existing repository types: FS, Azure, S3, GCS. The encrypted repository is password protected. The
platinum
license is required to snapshot to the encrypted repository, but no license is required to list or restore already encrypted snapshots.Example how to use the encrypted FS repository:
path.repo: ["/tmp/repo"]
test_enc
repository name:Overview how it works
Every data blob is encrypted (AES/GCM) independently with a randomly generated AES256 secret key. The key is stored in another metadata blob, which is itself encrypted (AES/GCM) with a key derived from the repository password. The metadata blob tree structure mimics the data blob tree structure, but it is rooted by the fixed blob container
encryption-metadata
.I will detail more how each piece works by commenting in the code source.
Relates #49896
Relates #41910
Obsoletes #48221