Skip to content

Commit

Permalink
content: update base64 post for clarity, and include janky solve script
Browse files Browse the repository at this point in the history
  • Loading branch information
TrebledJ committed Jun 9, 2024
1 parent 40301fa commit 4ae0b05
Showing 1 changed file with 27 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ We’re provided with an encryption script `chall.py` (written in Python), along

So how do we go about cracking this? Brute-force will be undoubtedly inefficient as we have $64! \approx 1.27 \times 10^{89}$ mapping combinations to try. It would take *years* before we have any progress! Also we’d need to look at results to determine if the English looks right (or automate it by checking a word list)—this would take even more time! Regardless, we need to find some other way.

## Let’s Get Cracking
## First Steps: Elimination by ASCII Range

Here’s one idea: since the plaintext is an English article, this means that most (if not all) characters are in the printable ASCII range (32-127). This means that the most significant bit (MSB) of each byte *cannot* be 1. We can use this to create a **blacklist** of mappings. For example, originally we have 64 mappings for the letter `A`. After blacklisting, we may be left with, say, 16 mappings. This drastically reduces the search space.[^extended-ascii]
Here’s one idea: since the plaintext is an English article, this means that most (if not all) characters are in the printable ASCII range (32-127). This means that the most significant bit (MSB) of each byte *cannot* be 1. We can use this to create a **blacklist** of mappings. For example, originally we have 64 possible mappings for the letter `A`. After blacklisting, we may be left with, say, 16 possible mappings. This drastically reduces the search space.[^extended-ascii]

Since Base64 simply maps 8-bits to 6-bits, so 3 characters of ASCII would be translated to 4 characters of Base64.

Expand All @@ -62,23 +62,31 @@ def get_chars_with_mask(m):
"""Get Base64 chars which are masked with m."""
return {c for i, c in enumerate(charset) if (i & m) == m}

# List the 4 Base64 positions. We'll cycle through these positions (i.e. i % 4).
msbs = [0b100000, 0b001000, 0b000010, 0b000000]

# Get impossible characters for each position.
subchars = [get_chars_with_mask(m) for m in msbs]

# Create a blacklist for each Base64 char.
# e.g. blacklist['A'] returns the set of chars which 'A' can NOT map to.
blacklist = {c: set() for c in charset}

# Loop through each char in the shuffled Base64 text.
for i, c in enumerate(txt):
# Ignore char mappings which have 1 in corresponding msb.
# Ignore char mappings which have '1' in corresponding msb.
# These can't map to a printable ASCII char.
blacklist[c] |= subchars[i % 4]

# Invert the blacklist to get a dictionary of possible mappings.
# e.g. whitelist['A'] returns the set of chars which 'A' CAN map to.
whitelist = {k: set(charset) - v for k, v in blacklist.items()}
```

We can check the mappings we’ve eliminated:

```python
print(''.join(sorted(blacklist['J']))
print(''.join(sorted(blacklist['J'])))
# '+/0123456789CDGHKLOPSTWXabefghijklmnopqrstuvwxyz'
```

Expand All @@ -97,9 +105,12 @@ We can do a similar thing on the low end. Again, since the smallest printable AS
def get_inverted_chars_with_mask(m):
return {c for i, c in enumerate(charset) if ((2**6 - 1 - i) & m) == m}

subchars_not_in_ascii = [get_inverted_chars_with_mask(m) for m in in_ascii] # chars that don't have bits set in ascii.
# chars that don't have bits set in ascii.
subchars_not_in_ascii = [get_inverted_chars_with_mask(m) for m in in_ascii]
```

## Frequency Analysis with Known Text

Another idea comes to mind. Remember the plaintext is in English? Well, with English text, some letters appear more frequently than others. The same applies to words and sequences.

{% image "assets/base64-letter-frequencies.jpg", "w-65", "Frequency of English letters. But we need to be careful with letter cases." %}
Expand All @@ -120,7 +131,7 @@ V2UncmUgbm8gc3RyYW5nZXJzIHRvIGxvdmUKWW91IGtub3cgdGhlIHJ1bGVzIGFuZCBzbyBkbyBJIChk
{% image "assets/b64-crypt-1gram.jpg", "", "dcode.fr frequency analysis for encrypted Base64." %}
{% endimages %}

<sup>Frequency analysis of plain vs. encrypted Base64.</sup>
<sup>Frequency analysis of plain vs. encrypted Base64. Left: CNN Lite articles. Right: Encrypted challenge text.</sup>
{.caption}

From this, we can deduce that 'w' was mapped from 'G' in the original encoding (due to the gap in frequency).
Expand All @@ -132,18 +143,22 @@ One useful option is the **bigrams/n-grams** option. We can tell dcode to analys
{% image "assets/b64-crypt-4gram.jpg", "", "dcode.fr 4-gram for encrypted Base64." %}
{% endimages %}

<sup>Frequency analysis of 4-grams in plain vs. encrypted Base64.</sup>
<sup>Frequency analysis of 4-grams in plain vs. encrypted Base64. Left: CNN Lite articles. Right: Encrypted challenge text.</sup>
{.caption}

Observe how "YoJP0H" occurs (relatively) frequently. This corresponds to "IHRoZS", which happens to be the Base64 encoding for " the".

## More Heuristics

Frequency analysis is useful to group letters into buckets. But using frequency analysis alone is painful. Some guesswork is needed. Here's the complete process I went through:

- Frequency Analysis: use dcode.fr to associate frequent characters.
- We can make use of our earlier constraints to eliminate wrong guesses.[^byebye-constraints]

```python
guesses = { # Dictionary of guessed mappings.
# Dictionary of guessed mappings.
# key: shuffled Base64; value: plain Base64
guesses = {
'w': 'G', 'Y': 'I',
'o': 'H', 'c': 'B',

Expand Down Expand Up @@ -204,3 +219,7 @@ hkcert22{b4s3_s1x7y_f0ur_1s_4n_3nc0d1n9_n07_4n_encryp710n}
[^newline]: But what about newline (`\n`, ASCII 10) and carriage return (`\r`, ASCII 13)? These are also possible to have in plaintext messages. We shouldn’t entirely discount these, but as they’re relatively rare, we won’t consider them for now.
[^byebye-constraints]: Later on, we removed the second/third-MSB constraint since it got in the way of decoding `\n`.
## Solve Script

After a request, I've uploaded my uncleaned, guessy, janky script [*here*](https://gist.github.com/TrebledJ/291b25df2bfc7105e08a9b9a5c30256d). Do with it what you will.

0 comments on commit 4ae0b05

Please sign in to comment.