-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve string randomness #173
Conversation
I would like to ask @jouae for reviewing. |
@wuyihung May I ask you how did you compute the arithmetic mean? I know the formula is But I want to know the set Second, is the theoretical value for entropy calculated by hand or computer? If your result comes from a computer, there might be slight deviations due to rounding errors or truncation errors. The former error depends on the precision of the data, while the latter depends on how the entropy function is discretized. Both of these can provide a mathematical analysis to determine the interval where the error lies. |
The PR contains a comprehensive description, analysis, and data, but these valuable insights are not included in the commit message. I believe including them in the commit message would be beneficial for future reference and understanding of the change history. |
Hi, @wuyihung, I use your python test program and plot the bar of dictionary. Not only 'y' 'z' but also 'q' 'r' less frequency than other characters For one run of your test, with each string size is between
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use your test program and compare before and after modifying by following setting:
In one run of your test, with each string size is between MIN_RANDSTR_LEN
and MAX_RANDSTR_LEN
and ih RAND 30
insert head of queue 30 strings. Do these operation for
Before modifying, the frequency of each characters from 'a' to 'z' is
[41080, 41161, 40861, 41197, 41150, 40723, 40969, 41191, 41006, 41099, 41278, 41312, 41146, 40890, 41113, 40846, 36915, 36894, 41141, 41231, 41035, 41269, 40993, 40905, 36696, 36887]
After modifying, the frequency of each characters from 'a' to 'z' is
[40629, 40651, 40560, 40750, 40237, 40244, 39793, 40428, 40261, 40372, 40540, 40571, 40310, 40795, 40222, 39952, 40452, 40058, 40700, 40048, 40713, 40398, 40332, 40431, 40047, 40328]
Great job!
@visitorckw To include the background to the commit message, I reverted and committed again with the same file change but with the detail background. @jouae It is interesting that the chart indicates 'q' and 'r' appear less frequently. I assumed that 'w' and 'x' instead to appear less frequently. Other than that, this also triggered me to correct my original description to state that there are four instead of three characters appearing less frequently. |
@wuyihung I use - buf[n] = charset[buf[n] % (sizeof(charset) - 1)];
+ buf[n] = charset[buf[n] % 26U]; more detail see My notes, written in Traditional Chinese. |
I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use |
7337f95
to
05f6f45
Compare
To avoid risk of force pushing, I thought it is better not to amend or rebase commits which have been pushed to remote repositories. But since the pull request is still open, the risk might be very low. Therefore, I rebased to organize the commit history. |
Force push is a common practice for updating PR content on GitHub. |
I found that there are missing sentences in the commit message. Plan to correct it tonight. |
05f6f45
to
99d53dd
Compare
Done. |
Current implementation of function fill_rand_string() uses randombytes() to get random bytes and then gets modulus of 26. However, since the number of variations is 256, which is not exact division of 26, this causes the last four characters 'w', 'x', 'y', and 'z' appearing with less frequency than other characters. By testing, the entropy 4.699504 and arithmetic mean 109.3771 slightly deviates from the theoretical values log2(26)=4.700440 and 109.5, respectively. Regarding the samples and function to calculate the arithmetic mean, 150,000 samples were generated via the command "ih RAND" and these samples are used as argument to the "ent" command to calculate entropy and arithmetic mean. Here we expand buffer to 64-bit unsigned integer before getting random bytes. Calculating modulus on 64-bit unsigned integer gives more random result. After implementation, the entropy 4.700423 and arithmetic mean 109.5105 are improved to be closer to theoretical values.
99d53dd
to
d48c56d
Compare
Thank @wuyihung for contributing! |
Problem
Current implementation of function fill_rand_string uses randombytes() to get random bytes and then get modulus of 26. However, since the number of variations is 256, which is not exact division of 26, and this causes the last four characters 'w', 'x', 'y', and 'z' appearing with less frequency than other characters. By testing with the script https://github.com/wuyihung/lab0-c/blob/master/scripts/test_prng.py, it is found that the entropy and arithmetic mean slightly deviates from the theoretical values:
Solution
We expand buffer to 64-bit unsigned integer before getting random bytes. Calculating modulus on 64-bit unsigned integer gives more random result.
Verification
After implementation, the entropy and arithmetic mean improves to be closer to theoretical values: