Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve string randomness #173

Merged
merged 1 commit into from
Mar 20, 2024
Merged

Conversation

wuyihung
Copy link
Contributor

@wuyihung wuyihung commented Mar 17, 2024

Problem

Current implementation of function fill_rand_string uses randombytes() to get random bytes and then get modulus of 26. However, since the number of variations is 256, which is not exact division of 26, and this causes the last four characters 'w', 'x', 'y', and 'z' appearing with less frequency than other characters. By testing with the script https://github.com/wuyihung/lab0-c/blob/master/scripts/test_prng.py, it is found that the entropy and arithmetic mean slightly deviates from the theoretical values:

PRNG Theoretical Current
Entropy (bits per byte) 4.700440 4.699504
Arithmetic mean 109.5 109.3771

Solution

We expand buffer to 64-bit unsigned integer before getting random bytes. Calculating modulus on 64-bit unsigned integer gives more random result.

Verification

After implementation, the entropy and arithmetic mean improves to be closer to theoretical values:

PRNG Theoretical After implementation
Entropy (bits per byte) 4.700440 4.700423
Arithmetic mean 109.5 109.5105

@jserv
Copy link
Contributor

jserv commented Mar 17, 2024

I would like to ask @jouae for reviewing.

@jserv jserv changed the title Improve fill_rand_string Improve string randomness Mar 17, 2024
@jouae
Copy link

jouae commented Mar 17, 2024

@wuyihung May I ask you how did you compute the arithmetic mean?

I know the formula is
$$\bar{x} =\dfrac{1}{n} \sum_{i=1}^{n} x_i$$

But I want to know the set $\lbrace x_i \rbrace_{i=1}^n$ and function that you choose for calculate the mean.

Second, is the theoretical value for entropy calculated by hand or computer?

If your result comes from a computer, there might be slight deviations due to rounding errors or truncation errors.

The former error depends on the precision of the data, while the latter depends on how the entropy function is discretized.

Both of these can provide a mathematical analysis to determine the interval where the error lies.

@wuyihung
Copy link
Contributor Author

  1. Regarding the samples and function to calculate the arithmetic mean, 150,000 samples were generated via the command "ih RAND" and these samples are used as argument to the "ent" command to calculate the arithmetic mean. For example, the recent test is as follows:
    image
  2. The theoretical value of entropy refers to the formula stated in the following course note. For English lowercase characters, the maximal entropy is $S=log_2(26)=4.700439718141092$.
    image

@visitorckw
Copy link
Contributor

The PR contains a comprehensive description, analysis, and data, but these valuable insights are not included in the commit message. I believe including them in the commit message would be beneficial for future reference and understanding of the change history.

@jouae
Copy link

jouae commented Mar 17, 2024

Hi, @wuyihung,

I use your python test program and plot the bar of dictionary. Not only 'y' 'z' but also 'q' 'r' less frequency than other characters

For one run of your test, with each string size is between MIN_RANDSTR_LEN and MAX_RANDSTR_LEN and ih RAND 30 insert head of queue 30 strings. Do these operation for 150,000/30 times and finally get the count of each characters from 'a' to 'z' as following list:

[41080, 41161, 40861, 41197, 41150, 40723, 40969, 41191, 41006, 41099, 41278, 41312, 41146, 40890, 41113, 40846, 36915, 36894, 41141, 41231, 41035, 41269, 40993, 40905, 36696, 36887]

before
The data of the figure is before modifying fill_rand_string. The x-axis is the the characters and the y-axis is the frequency of the characters.

Copy link

@jouae jouae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use your test program and compare before and after modifying by following setting:

In one run of your test, with each string size is between MIN_RANDSTR_LEN and MAX_RANDSTR_LEN and ih RAND 30 insert head of queue 30 strings. Do these operation for $150,000/30$ times and finally get the count of each characters from 'a' to 'z' as following list:

Before modifying, the frequency of each characters from 'a' to 'z' is
[41080, 41161, 40861, 41197, 41150, 40723, 40969, 41191, 41006, 41099, 41278, 41312, 41146, 40890, 41113, 40846, 36915, 36894, 41141, 41231, 41035, 41269, 40993, 40905, 36696, 36887]
before

After modifying, the frequency of each characters from 'a' to 'z' is
[40629, 40651, 40560, 40750, 40237, 40244, 39793, 40428, 40261, 40372, 40540, 40571, 40310, 40795, 40222, 39952, 40452, 40058, 40700, 40048, 40713, 40398, 40332, 40431, 40047, 40328]
after

Great job!

@wuyihung
Copy link
Contributor Author

@visitorckw To include the background to the commit message, I reverted and committed again with the same file change but with the detail background.

@jouae It is interesting that the chart indicates 'q' and 'r' appear less frequently. I assumed that 'w' and 'x' instead to appear less frequently. Other than that, this also triggered me to correct my original description to state that there are four instead of three characters appearing less frequently.

@jouae
Copy link

jouae commented Mar 17, 2024

@wuyihung I use 26U replace sizeof(charset) - 1 and get the result that the frequency of 'w' 'x' 'y' 'z' is less than other characters.

-        buf[n] = charset[buf[n] % (sizeof(charset) - 1)];
+        buf[n] = charset[buf[n] % 26U];

more detail see My notes, written in Traditional Chinese.

@visitorckw
Copy link
Contributor

@visitorckw To include the background to the commit message, I reverted and committed again with the same file change but with the detail background.

I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use git rebase -i to organize the commit history, keeping only the last one.

@jserv jserv requested a review from jouae March 18, 2024 12:02
@wuyihung
Copy link
Contributor Author

I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use git rebase -i to organize the commit history, keeping only the last one.

To avoid risk of force pushing, I thought it is better not to amend or rebase commits which have been pushed to remote repositories. But since the pull request is still open, the risk might be very low. Therefore, I rebased to organize the commit history.

@visitorckw
Copy link
Contributor

I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use git rebase -i to organize the commit history, keeping only the last one.

To avoid risk of force pushing, I thought it is better not to amend or rebase commits which have been pushed to remote repositories. But since the pull request is still open, the risk might be very low. Therefore, I rebased to organize the commit history.

Force push is a common practice for updating PR content on GitHub.
Thanks for the nice work and the updates!

@wuyihung
Copy link
Contributor Author

I found that there are missing sentences in the commit message. Plan to correct it tonight.

@wuyihung
Copy link
Contributor Author

I found that there are missing sentences in the commit message. Plan to correct it tonight.

Done.

qtest.c Outdated Show resolved Hide resolved
Current implementation of function fill_rand_string() uses randombytes()
to get random bytes and then gets modulus of 26. However, since the
number of variations is 256, which is not exact division of 26, this
causes the last four characters 'w', 'x', 'y', and 'z' appearing with
less frequency than other characters. By testing, the entropy 4.699504
and arithmetic mean 109.3771 slightly deviates from the theoretical
values log2(26)=4.700440 and 109.5, respectively. Regarding the samples
and function to calculate the arithmetic mean, 150,000 samples were
generated via the command "ih RAND" and these samples are used as
argument to the "ent" command to calculate entropy and arithmetic mean.

Here we expand buffer to 64-bit unsigned integer before getting random
bytes. Calculating modulus on 64-bit unsigned integer gives more random
result.

After implementation, the entropy 4.700423 and arithmetic mean 109.5105
are improved to be closer to theoretical values.
@jserv jserv merged commit d43e072 into sysprog21:master Mar 20, 2024
1 of 2 checks passed
@jserv
Copy link
Contributor

jserv commented Mar 20, 2024

Thank @wuyihung for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants