diff --git a/content/posts/ctf/hkcert22/2022-11-14-hkcert-2022-base64-encryption.md b/content/posts/ctf/hkcert22/2022-11-14-hkcert-2022-base64-encryption.md index 2b26c52bc..ef9abb341 100644 --- a/content/posts/ctf/hkcert22/2022-11-14-hkcert-2022-base64-encryption.md +++ b/content/posts/ctf/hkcert22/2022-11-14-hkcert-2022-base64-encryption.md @@ -43,9 +43,9 @@ We’re provided with an encryption script `chall.py` (written in Python), along So how do we go about cracking this? Brute-force will be undoubtedly inefficient as we have $64! \approx 1.27 \times 10^{89}$ mapping combinations to try. It would take *years* before we have any progress! Also we’d need to look at results to determine if the English looks right (or automate it by checking a word list)—this would take even more time! Regardless, we need to find some other way. -## Let’s Get Cracking +## First Steps: Elimination by ASCII Range -Here’s one idea: since the plaintext is an English article, this means that most (if not all) characters are in the printable ASCII range (32-127). This means that the most significant bit (MSB) of each byte *cannot* be 1. We can use this to create a **blacklist** of mappings. For example, originally we have 64 mappings for the letter `A`. After blacklisting, we may be left with, say, 16 mappings. This drastically reduces the search space.[^extended-ascii] +Here’s one idea: since the plaintext is an English article, this means that most (if not all) characters are in the printable ASCII range (32-127). This means that the most significant bit (MSB) of each byte *cannot* be 1. We can use this to create a **blacklist** of mappings. For example, originally we have 64 possible mappings for the letter `A`. After blacklisting, we may be left with, say, 16 possible mappings. This drastically reduces the search space.[^extended-ascii] Since Base64 simply maps 8-bits to 6-bits, so 3 characters of ASCII would be translated to 4 characters of Base64. @@ -62,23 +62,31 @@ def get_chars_with_mask(m): """Get Base64 chars which are masked with m.""" return {c for i, c in enumerate(charset) if (i & m) == m} +# List the 4 Base64 positions. We'll cycle through these positions (i.e. i % 4). msbs = [0b100000, 0b001000, 0b000010, 0b000000] + +# Get impossible characters for each position. subchars = [get_chars_with_mask(m) for m in msbs] +# Create a blacklist for each Base64 char. +# e.g. blacklist['A'] returns the set of chars which 'A' can NOT map to. blacklist = {c: set() for c in charset} +# Loop through each char in the shuffled Base64 text. for i, c in enumerate(txt): - # Ignore char mappings which have 1 in corresponding msb. + # Ignore char mappings which have '1' in corresponding msb. # These can't map to a printable ASCII char. blacklist[c] |= subchars[i % 4] +# Invert the blacklist to get a dictionary of possible mappings. +# e.g. whitelist['A'] returns the set of chars which 'A' CAN map to. whitelist = {k: set(charset) - v for k, v in blacklist.items()} ``` We can check the mappings we’ve eliminated: ```python -print(''.join(sorted(blacklist['J'])) +print(''.join(sorted(blacklist['J']))) # '+/0123456789CDGHKLOPSTWXabefghijklmnopqrstuvwxyz' ``` @@ -97,9 +105,12 @@ We can do a similar thing on the low end. Again, since the smallest printable AS def get_inverted_chars_with_mask(m): return {c for i, c in enumerate(charset) if ((2**6 - 1 - i) & m) == m} -subchars_not_in_ascii = [get_inverted_chars_with_mask(m) for m in in_ascii] # chars that don't have bits set in ascii. +# chars that don't have bits set in ascii. +subchars_not_in_ascii = [get_inverted_chars_with_mask(m) for m in in_ascii] ``` +## Frequency Analysis with Known Text + Another idea comes to mind. Remember the plaintext is in English? Well, with English text, some letters appear more frequently than others. The same applies to words and sequences. {% image "assets/base64-letter-frequencies.jpg", "w-65", "Frequency of English letters. But we need to be careful with letter cases." %} @@ -120,7 +131,7 @@ V2UncmUgbm8gc3RyYW5nZXJzIHRvIGxvdmUKWW91IGtub3cgdGhlIHJ1bGVzIGFuZCBzbyBkbyBJIChk {% image "assets/b64-crypt-1gram.jpg", "", "dcode.fr frequency analysis for encrypted Base64." %} {% endimages %} -Frequency analysis of plain vs. encrypted Base64. +Frequency analysis of plain vs. encrypted Base64. Left: CNN Lite articles. Right: Encrypted challenge text. {.caption} From this, we can deduce that 'w' was mapped from 'G' in the original encoding (due to the gap in frequency). @@ -132,18 +143,22 @@ One useful option is the **bigrams/n-grams** option. We can tell dcode to analys {% image "assets/b64-crypt-4gram.jpg", "", "dcode.fr 4-gram for encrypted Base64." %} {% endimages %} -Frequency analysis of 4-grams in plain vs. encrypted Base64. +Frequency analysis of 4-grams in plain vs. encrypted Base64. Left: CNN Lite articles. Right: Encrypted challenge text. {.caption} Observe how "YoJP0H" occurs (relatively) frequently. This corresponds to "IHRoZS", which happens to be the Base64 encoding for " the". +## More Heuristics + Frequency analysis is useful to group letters into buckets. But using frequency analysis alone is painful. Some guesswork is needed. Here's the complete process I went through: - Frequency Analysis: use dcode.fr to associate frequent characters. - We can make use of our earlier constraints to eliminate wrong guesses.[^byebye-constraints] ```python - guesses = { # Dictionary of guessed mappings. + # Dictionary of guessed mappings. + # key: shuffled Base64; value: plain Base64 + guesses = { 'w': 'G', 'Y': 'I', 'o': 'H', 'c': 'B', @@ -204,3 +219,7 @@ hkcert22{b4s3_s1x7y_f0ur_1s_4n_3nc0d1n9_n07_4n_encryp710n} [^newline]: But what about newline (`\n`, ASCII 10) and carriage return (`\r`, ASCII 13)? These are also possible to have in plaintext messages. We shouldn’t entirely discount these, but as they’re relatively rare, we won’t consider them for now. [^byebye-constraints]: Later on, we removed the second/third-MSB constraint since it got in the way of decoding `\n`. + +## Solve Script + +After a request, I've uploaded my uncleaned, guessy, janky script [*here*](https://gist.github.com/TrebledJ/291b25df2bfc7105e08a9b9a5c30256d). Do with it what you will.