Skip to content

Thoughts on the library

Cameron Lonsdale edited this page Apr 24, 2017 · 1 revision

When to format user data on their behalf

To what degree are we going to preprocess the users data on their behalf? For instance, when cracking a ciphertext, we handle removing any whitespace / punctuation when needed in order for our scoring to be more effective, however let the user leave the punctuation in as when the decryption comes back those details may be needed.

Since we abstract away the scoring, its important that we handle the data on the users behalf.

Another example is the chi_squared function which takes two frequency distributions and compares them. For consistency it takes a source frequency and target frequency, it then turns the target frequency into a probability distribution. what happens when the characters arent the same in the frequency? We get key errors because they arent in the dictionary. However this burdens the user in having to know that the english unigrams are uppercase and are just the letters A - Z.

What we could do is abstract this stripping away in frequency_analyze. We can add an optional flag to strip away whitespace / punctuation, which is set by default. if a user wants to frequency analyze the whole text they can.

I want to strike a balance where the library knows what the user most likely wants to do, and have all the defaults there ready for them to do it easily. But if a user wants to analyze extra things, they can, easily.

Clone this wiki locally