Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encryption for chunks #182

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Encryption for chunks #182

wants to merge 9 commits into from

Conversation

folbricht
Copy link
Owner

@folbricht folbricht commented Dec 27, 2020

Adds chunk encryption as option to chunk stores. Supports aes-256-ctr at this point (though this could be expanded to other key sizes or algorithms). The key is generated by hashing the password with SHA256. For now the only way to configure encryption is by using the config file.

TODO:

  • Document how to use it
  • Add a way to provide encryption for HTTP chunk servers, since the store config doesn't apply to server-side HTTP. Probably a new set of options.
  • Perhaps a couple more tests, ideally a high-level one that uses a config.json

Any (early) feedback is welcome.

@folbricht
Copy link
Owner Author

Open questions:

  • Is CTR mode the right choice? Since a 16-byte random IV is used it appears to be sufficiently safe from ever repeating. Alternatively could use CBC with random IV
  • Should encrypted chunks get a dedicated file extension? Right now we have compressed chunks with .cacnk, uncompressed with none. Could consider appending something like .aes-256-cbc in order to avoid conflicts. Though there would still be conflicts if a different password is used for the same store.

@charles-dyfis-net
Copy link
Collaborator

With respect to naming convention, my own preference would be to support having a key identifier in the name for encrypted chunks. Whether that's a few bytes out of a hash of the key or a user-specified opaque value is something that would require careful design consideration (the former is an information leak, the latter depends on users to do the right thing, so neither is ideal; this is one of those places where cryptosystems that generate handles for keys alongside those keys -- or use assymetric encryption and can take the handle from a hash of the public side of the key -- are making lives easier).

@charles-dyfis-net
Copy link
Collaborator

I'd be tempted to consider GCM-CTR with the IV deriving from the chunk's identifier.

@folbricht
Copy link
Owner Author

It now calculates a key handle using the 4 leading bytes of SHA256(key). That should be acceptable and unique enough. It's also adding file extensions, together with the algorithm which should ensure that compressed, uncompressed, encrypted chunks can live next to each other in the same store.

As for using the chunkID as IV, I'm still trying to think of ways that could be abused. Especially in the HTTP server where the chunk ID is passed from the outside. There are flags to disable validation on that part for performance, though it should still get checked before it ever gets written to a store. A small mistake in this area could be fatal for encryption.

Are you mostly concerned with the possibility of an IV collision or the extra prefix that is stored? If it's mostly about collisions, I could pick CBC mode as default which is less impacted by a collision, though also slower.

@charles-dyfis-net
Copy link
Collaborator

The goal was to eliminate potential for collisions while getting the advantage of GCM-CTR's integrity protection -- and the collision protection inherent given the nature of chunk identifiers.

It would indeed be good to get feedback from someone with a stronger background there.

@folbricht
Copy link
Owner Author

After reading a lot more about the different modes and IV generation, my suggestion is this:

Both are supported in TLS1.3 and are AEAD, though that isn't strictly necessary in this use-case it won't hurt. Not sure yet what to do with the CTR mode, could leave it in as 3rd option perhaps, with the same restrictions as GCM.

@folbricht folbricht changed the title AES encryption for chunks Encryption for chunks Dec 31, 2020
@folbricht
Copy link
Owner Author

Removed CTR mode which leaves the more modern AEAD algorithms xchacha20-poly1305 and aes-256-gcm.

@folbricht folbricht marked this pull request as ready for review January 9, 2021 16:23
@antong
Copy link

antong commented Jan 14, 2021

Cool! Some questions/notes: Why was key=SHA256(password) chosen instead of a dedicated key derivation function such as scrypt?

The chunk file names are not encrypted and if I understand correctly, this mean that the hashes of plaintext content are visible to whoever can observe these file names (fs access or e.g., web servers). Depending on the threat model, this could be catastrophic. The same weaknesses that apply to convergent encryption would apply to this. For instance, see https://www.mail-archive.com/[email protected]/msg08949.html .

I also recommend taking a look at how restic does encryption, as it has much in common with casync.

@folbricht
Copy link
Owner Author

A couple of reasons for choosing SHA256. The goal is to have one password that is used to encrypt/decrypt consistently. With bcrypt/scrypt one needs a salt. So I'd have to either use a fixed salt, or derive the salt from the password. Neither is a good choice. Also, scrypt is mainly about protecting the original password when it's derived key value is leaked. In this case though, if the key leaked, it's sufficient to decrypt. No need to break the password.

Yes, the content hashes are not encrypted. I don't consider that an issue though, they really mean nothing to an observer. They'e also listed in the index files which aren't encrypted.

@antong
Copy link

antong commented Jan 16, 2021

Thank you for taking the time to answer!

What kind of threats is the encryption intended to protect against? I was thinking of something like the Restic threat model. For example, that the chunks are stored on a system that is not fully trusted (like cloud storage or a shared system) where outsiders may have access to the files or file names. Then encryption could be used so that these outsiders can not find out anything about the content stored.

Please take the below with a grain of salt, as I'm not a cryptographer :-)

If an attacker can get hold of a chunk and the key is simply SHA256(password), then this password can be efficiently bruteforced as both SHA2 and AES are designed to be fast, and have hardware implementation on standard CPUs and can also be done on GPUs. Scrypt on the other hand is a KDF and is designed to be slow and hard to bruteforce. Even with a static or derived salt, it should be better than just SHA256. Could a salt be saved within the chunk store or locally in the config? I think restic uses a master encryption key for a repository, and then encrypts this master key with one or several keys derived using scrypt+password, and stores these encrypted keys and scrypt salts within the repository. This has the added benefit of being able to change the password and having several passwords.

If an observer can see the chunk filenames/IDs (hashes of the content), then this observer can tell whether a plaintext blob known to the observer is stored in or fetched from the chunk store (list file names or view fetched urls in web server logs). In some cases it this may be harmless as the observer already needs to have the plain text content to know what the hash is, but there are numerous scenarios where this can be disastrous. For example, if an internal Amazon memo is leaked, then Amazon can quickly determine which users have stored the leaked memo in encrypted chunk stores in S3 by scanning for file names with the known hashes. Or, an attacker can precompute chunk hashes of vulnerable software and libraries, and mount directed attacks at IP addresses that fetch these encrypted chunks.

And, even if the observer doesn't know the plaintext, he may be able to brute force the plaintext if he knows the hashes! There may be file formats or templates where most of the content is the same, but a small portion may vary. Then, the attacker can test by putting different content in the variable part, hash the result, and determine whether such an chunk exists in the encrypted chunk store. This is the kind of attack that was explained in the email thread I linked.

But I may be misunderstanding the way the encryption is supposed to be used. For instance you said that the chunk hashes are in the index files as well. This is true, but I was thinking that these would be stored locally somewhere else and not visible to the operator of the chunk store.

@folbricht
Copy link
Owner Author

Regarding the privacy aspect you're correct. This implementation protects data that is unknown to an attacker. It does not protect against an attacker that has full access to logs and the chunk store from checking if a certain file may be contained in it. And if the attacker has access to the index file and a full access log then that could be used to confirm if that index file was used, but not what's in it. If this is a concern, then this isn't the right tool. It really only protects the data in chunks from being exposed if leaked or accessed without the password. It'll not obfuscate access.

With regards to SHA256 vs scrypt, that's a different story. It makes no difference here at all. Consider you're saying it'd be easy to brute force the password. Well, you wouldn't have to, right? Since brute forcing this would mean guessing the password for a given key (which you can't actually get, see below). But, since you have the key in this case, why would you need the password? You have the key, you can decrypt all of it, no need to know the password as well. The SHA256 here is way to generate a key of the correct length from a variable-length password. The alternative would be to ask the user to provide a key of the right size, in binary. It would be equally secure if chosen properly.

Now, to get the key you'd have to brute-force AES and that's just not feasible, even if you knew the plain text. If this was easy, then AES wouldn't be a thing at all. Basically, you won't be able to get the key.

Encrypting the chunk filenames to hide the content-hashes as well may be possible, not sure it's the right thing to do for this tool though, but will consider it. The would prevent attackers from checking if a file is in the store with trial-and-error, but it won't close all the holes, since file-size also plays a role in this, it's not as straightforward

@antong
Copy link

antong commented Jan 18, 2021

I am assuming that the attacker has an encrypted chunk file, but not the password or the AES key. I also assume that using a password implies that the effective key space may be smaller than the AES key space. If you used a randomly generated password with 256 bits of entropy (like a totally random 43 character case sensitive alphanumeric password), then yes, SHA256 vs scrypt wouldn't matter. But as the key space for user chosen passwords may be significantly smaller, then password cracking using brute force + dictionary attacks can be effective.

To mount a successful password cracking attack given the assumptions above, you don't need to know the AES key. When you test a password, you derive the key, use it to decrypt the first block in the chunk and check for the Zstandard magic number. Or since the last four bytes of the sha256 sum of the derived key are stored last in the filename, this can also be used as a check for correct password. Or do both. These operations are really quick (something like millions of tests per second on a single laptop core and billions on a consumer GPU).

@folbricht
Copy link
Owner Author

Ok, so the main issue is really short/insecure passwords. That is indeed a problem which the tool doesn't currently protect against. scrypt would make the keygen operation more expensive and perhaps allow for shorter passwords, but is that really the solution? Especially with a fixed salt it's not great either. One alternative would be to expect the whole key to be configured, all 256 bits of it. It's easy enough to generated with something like openssl rand -hex 32 and that could be documented. Then there's no chance to pick a bad password.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants