Histograms: Fix rounding and improve format #4

ramongonze · 2024-04-16T02:07:32Z

Problem: The histogram provided by the package (for all attacks) is rounding probabilities with <0.5% to 0%, but for any attack the probability can never be 0. The probabilities should not be rounded.

Improvement: Each probability is being rounding before generating the histogram. The histogram could be given without any rounding in the following way:

Set a histogram with 100 bins: (0,1], (1,2], ..., (99,100]
Each probability could be placed in the corresponding bin. For instance, 0.4375 would be placed in the 43th bin, (43,44].

As the histogram is already a dictionary, it could be given with the interval of bins as the labels. For example:

{
"(0,1]": probability_1,
"(1,2]": probability_2,
...
"(99,100]": probability_100,
}

The text was updated successfully, but these errors were encountered:

luigidcsoares · 2024-04-16T03:37:49Z

Regardless of what we use as the labels, we need to compute which bin each group belongs to. This is currently done by rounding, which causes the problem. Replacing round with math.ceil solves the problem partially. It helps with the rounding error, but there still might be floating-point errors. For example, say the actual value is 0.05, meaning that it should be placed in the bin (4, 5], but due to floating-point errors we get 0.0500000000001, Then, it is going to be rounded up to 0.06, and thus it is going to be placed in the bin (5, 6]. We can solve this, e.g., by using the Decimal type instead, but (probably) at the cost of performance. Nevertheless, perhaps this is not a real concern, as it should not occur very often, and the histogram is just a visual representation to try and understand how the data behaves.

Regarding the labels, @nunesgh I think you can use tuples for the intervals, and (this may depend on the Pandas version, but I'm using an old one, 1.5.3) you don't need to convert the dictionary to string:

>>> df = pd.DataFrame({"histogram": [ { (1, 2): 0.8, (2, 3): 0.2 }, { (1, 2): 0.3, (2, 3): 0.7 }  ]})
>>> df.columns
Index(['histogram'], dtype='object')
>>> df
                    histogram
0  {(1, 2): 0.8, (2, 3): 0.2}
1  {(1, 2): 0.3, (2, 3): 0.7}
>>> df.loc[0, "histogram"]
{(1, 2): 0.8, (2, 3): 0.2}
>>> type(df.loc[0, "histogram"])
<class 'dict'>

ramongonze · 2024-04-16T15:31:07Z

I still don't understand why it's necessary to round (or ceiling) the probabilities instead of only using the original value itself and place it in the proper bin.

luigidcsoares · 2024-04-17T01:11:39Z

place it in the proper bin.

And how would we do that without computing the index of the bin? We could traverse all the bins every time and compare the probability against the intervals, but it's worst in terms of performance and it doesn't solve the floating-point error.

ramongonze · 2024-04-17T14:26:14Z

If there are 100 bins, where bin 1 is (0,1], bin 2 (1,2], and so on, for a given probability $p$ the index will be int(p*100). More precisely, max(1,int(p*100)), to put probabilities $p < 0.01$ in the first bin.

luigidcsoares · 2024-04-17T20:46:41Z

If there are 100 bins, where bin 1 is (0,1], bin 2 (1,2], and so on, for a given probability p the index will be int(p*100). More precisely, max(1,int(p*100)), to put probabilities p<0.01 in the first bin.

You're still rounding, but down instead (or, more precisely, truncating). And, it doesn't work the way we want. Say $p = 1.5\%$. Then, max(1, int(1.5/100 * 100)) = max(1, int(1.5)) = max(1, 1) = 1, which in your approach corresponds to bin (0, 1], but 1.5 should be placed in the bin 2 (1, 2].

ramongonze · 2024-04-18T00:37:36Z

I'm sorry, you're right. It seems that using ceiling is more appropriate. So the bin index would be ceil(p*100), where $p\in[0,1]$, right?

luigidcsoares · 2024-04-18T00:57:10Z

So the bin index would be ceil(p*100), where p∈[0,1], right?

That's right! Take a look at the changes in #5.

luigidcsoares added the bug Something isn't working label Apr 16, 2024

nunesgh self-assigned this Apr 16, 2024

luigidcsoares linked a pull request Apr 17, 2024 that will close this issue

fix: build histogram x-axis as intervals (a, b] (#4) #5

Open

luigidcsoares assigned luigidcsoares and unassigned nunesgh Apr 17, 2024

This comment was marked as spam.

Sign in to view

Repository owner locked as spam and limited conversation to collaborators Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histograms: Fix rounding and improve format #4

Histograms: Fix rounding and improve format #4

ramongonze commented Apr 16, 2024

luigidcsoares commented Apr 16, 2024

ramongonze commented Apr 16, 2024

luigidcsoares commented Apr 17, 2024

ramongonze commented Apr 17, 2024 •

edited

Loading

luigidcsoares commented Apr 17, 2024

ramongonze commented Apr 18, 2024

luigidcsoares commented Apr 18, 2024

This comment was marked as spam.

Histograms: Fix rounding and improve format #4

Histograms: Fix rounding and improve format #4

Comments

ramongonze commented Apr 16, 2024

luigidcsoares commented Apr 16, 2024

ramongonze commented Apr 16, 2024

luigidcsoares commented Apr 17, 2024

ramongonze commented Apr 17, 2024 • edited Loading

luigidcsoares commented Apr 17, 2024

ramongonze commented Apr 18, 2024

luigidcsoares commented Apr 18, 2024

This comment was marked as spam.

ramongonze commented Apr 17, 2024 •

edited

Loading