Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histograms: Fix rounding and improve format #4

Open
ramongonze opened this issue Apr 16, 2024 · 10 comments · May be fixed by #5
Open

Histograms: Fix rounding and improve format #4

ramongonze opened this issue Apr 16, 2024 · 10 comments · May be fixed by #5
Assignees
Labels
bug Something isn't working

Comments

@ramongonze
Copy link

Problem: The histogram provided by the package (for all attacks) is rounding probabilities with <0.5% to 0%, but for any attack the probability can never be 0. The probabilities should not be rounded.

Improvement: Each probability is being rounding before generating the histogram. The histogram could be given without any rounding in the following way:

  1. Set a histogram with 100 bins: (0,1], (1,2], ..., (99,100]
  2. Each probability could be placed in the corresponding bin. For instance, 0.4375 would be placed in the 43th bin, (43,44].

As the histogram is already a dictionary, it could be given with the interval of bins as the labels. For example:

{
"(0,1]": probability_1,
"(1,2]": probability_2,
...
"(99,100]": probability_100,
}
@luigidcsoares luigidcsoares added the bug Something isn't working label Apr 16, 2024
@luigidcsoares
Copy link
Collaborator

Regardless of what we use as the labels, we need to compute which bin each group belongs to. This is currently done by rounding, which causes the problem. Replacing round with math.ceil solves the problem partially. It helps with the rounding error, but there still might be floating-point errors. For example, say the actual value is 0.05, meaning that it should be placed in the bin (4, 5], but due to floating-point errors we get 0.0500000000001, Then, it is going to be rounded up to 0.06, and thus it is going to be placed in the bin (5, 6]. We can solve this, e.g., by using the Decimal type instead, but (probably) at the cost of performance. Nevertheless, perhaps this is not a real concern, as it should not occur very often, and the histogram is just a visual representation to try and understand how the data behaves.

Regarding the labels, @nunesgh I think you can use tuples for the intervals, and (this may depend on the Pandas version, but I'm using an old one, 1.5.3) you don't need to convert the dictionary to string:

>>> df = pd.DataFrame({"histogram": [ { (1, 2): 0.8, (2, 3): 0.2 }, { (1, 2): 0.3, (2, 3): 0.7 }  ]})
>>> df.columns
Index(['histogram'], dtype='object')
>>> df
                    histogram
0  {(1, 2): 0.8, (2, 3): 0.2}
1  {(1, 2): 0.3, (2, 3): 0.7}
>>> df.loc[0, "histogram"]
{(1, 2): 0.8, (2, 3): 0.2}
>>> type(df.loc[0, "histogram"])
<class 'dict'>

@nunesgh nunesgh self-assigned this Apr 16, 2024
@ramongonze
Copy link
Author

I still don't understand why it's necessary to round (or ceiling) the probabilities instead of only using the original value itself and place it in the proper bin.

@luigidcsoares
Copy link
Collaborator

place it in the proper bin.

And how would we do that without computing the index of the bin? We could traverse all the bins every time and compare the probability against the intervals, but it's worst in terms of performance and it doesn't solve the floating-point error.

@ramongonze
Copy link
Author

ramongonze commented Apr 17, 2024

If there are 100 bins, where bin 1 is (0,1], bin 2 (1,2], and so on, for a given probability $p$ the index will be int(p*100). More precisely, max(1,int(p*100)), to put probabilities $p &lt; 0.01$ in the first bin.

@luigidcsoares
Copy link
Collaborator

If there are 100 bins, where bin 1 is (0,1], bin 2 (1,2], and so on, for a given probability p the index will be int(p*100). More precisely, max(1,int(p*100)), to put probabilities p<0.01 in the first bin.

You're still rounding, but down instead (or, more precisely, truncating). And, it doesn't work the way we want. Say $p = 1.5\%$. Then, max(1, int(1.5/100 * 100)) = max(1, int(1.5)) = max(1, 1) = 1, which in your approach corresponds to bin (0, 1], but 1.5 should be placed in the bin 2 (1, 2].

@ramongonze
Copy link
Author

I'm sorry, you're right. It seems that using ceiling is more appropriate. So the bin index would be ceil(p*100), where $p\in[0,1]$, right?

@luigidcsoares
Copy link
Collaborator

So the bin index would be ceil(p*100), where p∈[0,1], right?

That's right! Take a look at the changes in #5.

@utycampus

This comment was marked as spam.

Repository owner locked as spam and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@nunesgh @luigidcsoares @ramongonze @utycampus and others