-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encryption for chunks #182
base: master
Are you sure you want to change the base?
Conversation
Open questions:
|
With respect to naming convention, my own preference would be to support having a key identifier in the name for encrypted chunks. Whether that's a few bytes out of a hash of the key or a user-specified opaque value is something that would require careful design consideration (the former is an information leak, the latter depends on users to do the right thing, so neither is ideal; this is one of those places where cryptosystems that generate handles for keys alongside those keys -- or use assymetric encryption and can take the handle from a hash of the public side of the key -- are making lives easier). |
I'd be tempted to consider GCM-CTR with the IV deriving from the chunk's identifier. |
It now calculates a key handle using the 4 leading bytes of SHA256(key). That should be acceptable and unique enough. It's also adding file extensions, together with the algorithm which should ensure that compressed, uncompressed, encrypted chunks can live next to each other in the same store. As for using the chunkID as IV, I'm still trying to think of ways that could be abused. Especially in the HTTP server where the chunk ID is passed from the outside. There are flags to disable validation on that part for performance, though it should still get checked before it ever gets written to a store. A small mistake in this area could be fatal for encryption. Are you mostly concerned with the possibility of an IV collision or the extra prefix that is stored? If it's mostly about collisions, I could pick CBC mode as default which is less impacted by a collision, though also slower. |
The goal was to eliminate potential for collisions while getting the advantage of GCM-CTR's integrity protection -- and the collision protection inherent given the nature of chunk identifiers. It would indeed be good to get feedback from someone with a stronger background there. |
After reading a lot more about the different modes and IV generation, my suggestion is this:
Both are supported in TLS1.3 and are AEAD, though that isn't strictly necessary in this use-case it won't hurt. Not sure yet what to do with the CTR mode, could leave it in as 3rd option perhaps, with the same restrictions as GCM. |
Removed CTR mode which leaves the more modern AEAD algorithms |
Cool! Some questions/notes: Why was key=SHA256(password) chosen instead of a dedicated key derivation function such as scrypt? The chunk file names are not encrypted and if I understand correctly, this mean that the hashes of plaintext content are visible to whoever can observe these file names (fs access or e.g., web servers). Depending on the threat model, this could be catastrophic. The same weaknesses that apply to convergent encryption would apply to this. For instance, see https://www.mail-archive.com/[email protected]/msg08949.html . I also recommend taking a look at how restic does encryption, as it has much in common with casync. |
A couple of reasons for choosing SHA256. The goal is to have one password that is used to encrypt/decrypt consistently. With bcrypt/scrypt one needs a salt. So I'd have to either use a fixed salt, or derive the salt from the password. Neither is a good choice. Also, scrypt is mainly about protecting the original password when it's derived key value is leaked. In this case though, if the key leaked, it's sufficient to decrypt. No need to break the password. Yes, the content hashes are not encrypted. I don't consider that an issue though, they really mean nothing to an observer. They'e also listed in the index files which aren't encrypted. |
Thank you for taking the time to answer! What kind of threats is the encryption intended to protect against? I was thinking of something like the Restic threat model. For example, that the chunks are stored on a system that is not fully trusted (like cloud storage or a shared system) where outsiders may have access to the files or file names. Then encryption could be used so that these outsiders can not find out anything about the content stored. Please take the below with a grain of salt, as I'm not a cryptographer :-) If an attacker can get hold of a chunk and the key is simply SHA256(password), then this password can be efficiently bruteforced as both SHA2 and AES are designed to be fast, and have hardware implementation on standard CPUs and can also be done on GPUs. Scrypt on the other hand is a KDF and is designed to be slow and hard to bruteforce. Even with a static or derived salt, it should be better than just SHA256. Could a salt be saved within the chunk store or locally in the config? I think restic uses a master encryption key for a repository, and then encrypts this master key with one or several keys derived using scrypt+password, and stores these encrypted keys and scrypt salts within the repository. This has the added benefit of being able to change the password and having several passwords. If an observer can see the chunk filenames/IDs (hashes of the content), then this observer can tell whether a plaintext blob known to the observer is stored in or fetched from the chunk store (list file names or view fetched urls in web server logs). In some cases it this may be harmless as the observer already needs to have the plain text content to know what the hash is, but there are numerous scenarios where this can be disastrous. For example, if an internal Amazon memo is leaked, then Amazon can quickly determine which users have stored the leaked memo in encrypted chunk stores in S3 by scanning for file names with the known hashes. Or, an attacker can precompute chunk hashes of vulnerable software and libraries, and mount directed attacks at IP addresses that fetch these encrypted chunks. And, even if the observer doesn't know the plaintext, he may be able to brute force the plaintext if he knows the hashes! There may be file formats or templates where most of the content is the same, but a small portion may vary. Then, the attacker can test by putting different content in the variable part, hash the result, and determine whether such an chunk exists in the encrypted chunk store. This is the kind of attack that was explained in the email thread I linked. But I may be misunderstanding the way the encryption is supposed to be used. For instance you said that the chunk hashes are in the index files as well. This is true, but I was thinking that these would be stored locally somewhere else and not visible to the operator of the chunk store. |
Regarding the privacy aspect you're correct. This implementation protects data that is unknown to an attacker. It does not protect against an attacker that has full access to logs and the chunk store from checking if a certain file may be contained in it. And if the attacker has access to the index file and a full access log then that could be used to confirm if that index file was used, but not what's in it. If this is a concern, then this isn't the right tool. It really only protects the data in chunks from being exposed if leaked or accessed without the password. It'll not obfuscate access. With regards to SHA256 vs scrypt, that's a different story. It makes no difference here at all. Consider you're saying it'd be easy to brute force the password. Well, you wouldn't have to, right? Since brute forcing this would mean guessing the password for a given key (which you can't actually get, see below). But, since you have the key in this case, why would you need the password? You have the key, you can decrypt all of it, no need to know the password as well. The SHA256 here is way to generate a key of the correct length from a variable-length password. The alternative would be to ask the user to provide a key of the right size, in binary. It would be equally secure if chosen properly. Now, to get the key you'd have to brute-force AES and that's just not feasible, even if you knew the plain text. If this was easy, then AES wouldn't be a thing at all. Basically, you won't be able to get the key. Encrypting the chunk filenames to hide the content-hashes as well may be possible, not sure it's the right thing to do for this tool though, but will consider it. The would prevent attackers from checking if a file is in the store with trial-and-error, but it won't close all the holes, since file-size also plays a role in this, it's not as straightforward |
I am assuming that the attacker has an encrypted chunk file, but not the password or the AES key. I also assume that using a password implies that the effective key space may be smaller than the AES key space. If you used a randomly generated password with 256 bits of entropy (like a totally random 43 character case sensitive alphanumeric password), then yes, SHA256 vs scrypt wouldn't matter. But as the key space for user chosen passwords may be significantly smaller, then password cracking using brute force + dictionary attacks can be effective. To mount a successful password cracking attack given the assumptions above, you don't need to know the AES key. When you test a password, you derive the key, use it to decrypt the first block in the chunk and check for the Zstandard magic number. Or since the last four bytes of the sha256 sum of the derived key are stored last in the filename, this can also be used as a check for correct password. Or do both. These operations are really quick (something like millions of tests per second on a single laptop core and billions on a consumer GPU). |
Ok, so the main issue is really short/insecure passwords. That is indeed a problem which the tool doesn't currently protect against. scrypt would make the keygen operation more expensive and perhaps allow for shorter passwords, but is that really the solution? Especially with a fixed salt it's not great either. One alternative would be to expect the whole key to be configured, all 256 bits of it. It's easy enough to generated with something like |
Adds chunk encryption as option to chunk stores. Supports aes-256-ctr at this point (though this could be expanded to other key sizes or algorithms). The key is generated by hashing the password with SHA256. For now the only way to configure encryption is by using the config file.
TODO:
Any (early) feedback is welcome.