Improve string randomness #173

wuyihung · 2024-03-17T06:23:57Z

Problem

Current implementation of function fill_rand_string uses randombytes() to get random bytes and then get modulus of 26. However, since the number of variations is 256, which is not exact division of 26, and this causes the last four characters 'w', 'x', 'y', and 'z' appearing with less frequency than other characters. By testing with the script https://github.com/wuyihung/lab0-c/blob/master/scripts/test_prng.py, it is found that the entropy and arithmetic mean slightly deviates from the theoretical values:

PRNG	Theoretical	Current
Entropy (bits per byte)	4.700440	4.699504
Arithmetic mean	109.5	109.3771

Solution

We expand buffer to 64-bit unsigned integer before getting random bytes. Calculating modulus on 64-bit unsigned integer gives more random result.

Verification

After implementation, the entropy and arithmetic mean improves to be closer to theoretical values:

PRNG	Theoretical	After implementation
Entropy (bits per byte)	4.700440	4.700423
Arithmetic mean	109.5	109.5105

jserv · 2024-03-17T07:19:37Z

I would like to ask @jouae for reviewing.

jouae · 2024-03-17T09:29:57Z

@wuyihung May I ask you how did you compute the arithmetic mean?

I know the formula is
$$\bar{x} =\dfrac{1}{n} \sum_{i=1}^{n} x_i$$

But I want to know the set $\lbrace x_i \rbrace_{i=1}^n$ and function that you choose for calculate the mean.

Second, is the theoretical value for entropy calculated by hand or computer?

If your result comes from a computer, there might be slight deviations due to rounding errors or truncation errors.

The former error depends on the precision of the data, while the latter depends on how the entropy function is discretized.

Both of these can provide a mathematical analysis to determine the interval where the error lies.

wuyihung · 2024-03-17T10:18:14Z

Regarding the samples and function to calculate the arithmetic mean, 150,000 samples were generated via the command "ih RAND" and these samples are used as argument to the "ent" command to calculate the arithmetic mean. For example, the recent test is as follows:
The theoretical value of entropy refers to the formula stated in the following course note. For English lowercase characters, the maximal entropy is $S=log_2(26)=4.700439718141092$.

visitorckw · 2024-03-17T11:57:55Z

The PR contains a comprehensive description, analysis, and data, but these valuable insights are not included in the commit message. I believe including them in the commit message would be beneficial for future reference and understanding of the change history.

jouae · 2024-03-17T12:43:34Z

Hi, @wuyihung,

I use your python test program and plot the bar of dictionary. Not only 'y' 'z' but also 'q' 'r' less frequency than other characters

For one run of your test, with each string size is between MIN_RANDSTR_LEN and MAX_RANDSTR_LEN and ih RAND 30 insert head of queue 30 strings. Do these operation for 150,000/30 times and finally get the count of each characters from 'a' to 'z' as following list:

[41080, 41161, 40861, 41197, 41150, 40723, 40969, 41191, 41006, 41099, 41278, 41312, 41146, 40890, 41113, 40846, 36915, 36894, 41141, 41231, 41035, 41269, 40993, 40905, 36696, 36887]

The data of the figure is before modifying fill_rand_string. The x-axis is the the characters and the y-axis is the frequency of the characters.

jouae

I use your test program and compare before and after modifying by following setting:

In one run of your test, with each string size is between MIN_RANDSTR_LEN and MAX_RANDSTR_LEN and ih RAND 30 insert head of queue 30 strings. Do these operation for $150,000/30$ times and finally get the count of each characters from 'a' to 'z' as following list:

Before modifying, the frequency of each characters from 'a' to 'z' is
[41080, 41161, 40861, 41197, 41150, 40723, 40969, 41191, 41006, 41099, 41278, 41312, 41146, 40890, 41113, 40846, 36915, 36894, 41141, 41231, 41035, 41269, 40993, 40905, 36696, 36887]

After modifying, the frequency of each characters from 'a' to 'z' is
[40629, 40651, 40560, 40750, 40237, 40244, 39793, 40428, 40261, 40372, 40540, 40571, 40310, 40795, 40222, 39952, 40452, 40058, 40700, 40048, 40713, 40398, 40332, 40431, 40047, 40328]

Great job!

wuyihung · 2024-03-17T16:20:37Z

@visitorckw To include the background to the commit message, I reverted and committed again with the same file change but with the detail background.

@jouae It is interesting that the chart indicates 'q' and 'r' appear less frequently. I assumed that 'w' and 'x' instead to appear less frequently. Other than that, this also triggered me to correct my original description to state that there are four instead of three characters appearing less frequently.

jouae · 2024-03-17T17:02:44Z

@wuyihung I use 26U replace sizeof(charset) - 1 and get the result that the frequency of 'w' 'x' 'y' 'z' is less than other characters.

-        buf[n] = charset[buf[n] % (sizeof(charset) - 1)];
+        buf[n] = charset[buf[n] % 26U];

more detail see My notes, written in Traditional Chinese.

visitorckw · 2024-03-18T04:52:00Z

@visitorckw To include the background to the commit message, I reverted and committed again with the same file change but with the detail background.

I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use git rebase -i to organize the commit history, keeping only the last one.

wuyihung · 2024-03-19T03:50:37Z

I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use git rebase -i to organize the commit history, keeping only the last one.

To avoid risk of force pushing, I thought it is better not to amend or rebase commits which have been pushed to remote repositories. But since the pull request is still open, the risk might be very low. Therefore, I rebased to organize the commit history.

visitorckw · 2024-03-19T04:45:17Z

I think we wouldn't want to see records of commits followed immediately by reverts in the upstream git history. We can use git rebase -i to organize the commit history, keeping only the last one.

To avoid risk of force pushing, I thought it is better not to amend or rebase commits which have been pushed to remote repositories. But since the pull request is still open, the risk might be very low. Therefore, I rebased to organize the commit history.

Force push is a common practice for updating PR content on GitHub.
Thanks for the nice work and the updates!

wuyihung · 2024-03-19T04:56:31Z

I found that there are missing sentences in the commit message. Plan to correct it tonight.

wuyihung · 2024-03-19T15:36:21Z

I found that there are missing sentences in the commit message. Plan to correct it tonight.

Done.

qtest.c

Current implementation of function fill_rand_string() uses randombytes() to get random bytes and then gets modulus of 26. However, since the number of variations is 256, which is not exact division of 26, this causes the last four characters 'w', 'x', 'y', and 'z' appearing with less frequency than other characters. By testing, the entropy 4.699504 and arithmetic mean 109.3771 slightly deviates from the theoretical values log2(26)=4.700440 and 109.5, respectively. Regarding the samples and function to calculate the arithmetic mean, 150,000 samples were generated via the command "ih RAND" and these samples are used as argument to the "ent" command to calculate entropy and arithmetic mean. Here we expand buffer to 64-bit unsigned integer before getting random bytes. Calculating modulus on 64-bit unsigned integer gives more random result. After implementation, the entropy 4.700423 and arithmetic mean 109.5105 are improved to be closer to theoretical values.

jserv · 2024-03-20T22:04:51Z

Thank @wuyihung for contributing!

jserv changed the title ~~Improve fill_rand_string~~ Improve string randomness Mar 17, 2024

jouae approved these changes Mar 17, 2024

View reviewed changes

wuyihung force-pushed the improve_fill_rand_string branch from 7337f95 to 05f6f45 Compare March 18, 2024 08:28

jserv requested a review from jouae March 18, 2024 12:02

wuyihung force-pushed the improve_fill_rand_string branch from 05f6f45 to 99d53dd Compare March 19, 2024 15:34

jserv reviewed Mar 19, 2024

View reviewed changes

qtest.c Outdated Show resolved Hide resolved

wuyihung force-pushed the improve_fill_rand_string branch from 99d53dd to d48c56d Compare March 20, 2024 14:59

jserv merged commit d43e072 into sysprog21:master Mar 20, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve string randomness #173

Improve string randomness #173

wuyihung commented Mar 17, 2024 •

edited

Loading

jserv commented Mar 17, 2024

jouae commented Mar 17, 2024

wuyihung commented Mar 17, 2024

visitorckw commented Mar 17, 2024

jouae commented Mar 17, 2024 •

edited

Loading

jouae left a comment

wuyihung commented Mar 17, 2024

jouae commented Mar 17, 2024

visitorckw commented Mar 18, 2024

wuyihung commented Mar 19, 2024

visitorckw commented Mar 19, 2024

wuyihung commented Mar 19, 2024

wuyihung commented Mar 19, 2024

jserv commented Mar 20, 2024

Improve string randomness #173

Improve string randomness #173

Conversation

wuyihung commented Mar 17, 2024 • edited Loading

Problem

Solution

Verification

jserv commented Mar 17, 2024

jouae commented Mar 17, 2024

wuyihung commented Mar 17, 2024

visitorckw commented Mar 17, 2024

jouae commented Mar 17, 2024 • edited Loading

jouae left a comment

Choose a reason for hiding this comment

wuyihung commented Mar 17, 2024

jouae commented Mar 17, 2024

visitorckw commented Mar 18, 2024

wuyihung commented Mar 19, 2024

visitorckw commented Mar 19, 2024

wuyihung commented Mar 19, 2024

wuyihung commented Mar 19, 2024

jserv commented Mar 20, 2024

wuyihung commented Mar 17, 2024 •

edited

Loading

jouae commented Mar 17, 2024 •

edited

Loading