Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

histogram: make timeseries histogram visualization compatible with v3 data format #5392

Merged
merged 6 commits into from
Nov 4, 2021

Conversation

yatbear
Copy link
Member

@yatbear yatbear commented Oct 29, 2021

Change the timeseries histogram visualization logics, before the count of a 0-width bin is split into two halves, now each zero-width bin count will be full allocated to one bin, with the restriction that all redistributed bins (or ranges) are closed-open except the last one.

  • Current visualization implementation with v3 histogram data input:
    image

  • Current visualization implementation with v2 histogram data input:
    image

#histogram #tpu

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! Code LGTM, just some comments on the code comments and tests. I do think this makes the Time Series visualization more correct, even if it's hard to actually generate a case where it is visually different.

FWIW - I expected to be able to get this to visually look different by changing the demo's single_value example to be ten 0.0 values. This should have triggered the "wrong" treatment (two peaks of count 5, rather than one peak of count 10), because this code would generate 30 intervals from -1.0 to 1.0, and right at the middle, the 0.0 values would fall on the middle boundary and get split 0.5/0.5.

However, that doesn't happen, which I eventually figured out is due to floating point error 🤦‍♂️ The bins generated by the logic in that case actually look like this:

0: {x: -1, dx: 0.06666666666666667, y: 0}
1: {x: -0.9333333333333333, dx: 0.06666666666666667, y: 0}
...
14: {x: -0.06666666666666665, dx: 0.06666666666666667, y: 10}
15: {x: 0, dx: 0.06666666666666667, y: 0}
...
29: ​{x: 0.9333333333333333, dx: 0.06666666666666667, y: 0}

Critically, although the x for bin 15 is exactly 0, for bin 14, x = 0.0666...5 while dx = 0.0666...7, so they don't quite add up to 0.0:

>>> -0.06666666666666665 + 0.06666666666666667
1.3877787807814457e-17

As a result, because it sees that 0.0 < 1.38777...e-17, doesn't split them evenly after all (it does technically put them in the "wrong" bin, but visually that's less important).

Anyway, maybe at some point I'll try to fix this by ensuring we actually use the same exact values for left/right edges of adjacent bins, since at least in theory having two different calculations could mean we double-count or incorrectly omit counts.

@@ -149,8 +149,14 @@ function rebuildBins(bins: Bin[], range: Range, binCount: number): Bin[] {

/**
* Computes how much of the input bin's 'y' counts should be allocated to a new
* range. For 0 width input bins, the allocation may be split in half across 2
* bins.
* range. For 0 width input bins, the allocation will be distributed to the last
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this description isn't quite right - zero width input bins shouldn't always get distributed to the last "result" bin (the one that is closed-closed), it depends on the alignment of the bins.

For example, we can imagine an input which has the following bins, where the final bin in each step has count 10:

Step 1: [0.0, 0.0), [0.0, 0.0]
Step 2: [1.0, 1.0), [1.0, 1.0]

In this case, if we imagine the output also has 2 bins, then it will span the range [0.0, 1.0] (the min and max across all steps), and so the output bins will be [0.0, 0.5) and [0.5, 1.0]. In that case, the count of 10 from step 1 should get distributed to the first output bin, not to the final output bin.

What I would recommend here is actually not to talk about the overall distribution of bins and just to focus on the specific arguments to the function. E.g. something like:

Computes how much of the input bin's 'y' counts should be allocated to this output bin.

Where both bins have non-zero width, this computed by multiplying the input y value by the ratio of the width-wise overlap in the bins to the total width of the output bin. (This can be thought of redistributing the overlapping "area" of the bar in the input histogram across the full width of the output bin.)

When the input bin has zero width (the output bin cannot have zero width by construction), we instead have to consider several cases depending on the open/closed-ness of the underlying intervals. If the zero width input bin has y value 0, the contribution is always 0. Otherwise, if zero width input bin has y value greater than 0, it must represent the closed interval [x, x]. In this case, it contributes the full value of y if and only if the output bin's interval contains x. This interval is the closed-open interval [resultLeft, resultRight), except if resultHasRightNeighbor is false, in which case it's the closed interval [resultLeft, resultRight].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the description, thanks!

@@ -276,8 +276,8 @@ describe('histogram util', () => {
)
).toEqual([
[
{x: 0, dx: 5, y: 100},
{x: 5, dx: 5, y: 100},
{x: 0, dx: 5, y: 0},
Copy link
Contributor

@nfelt nfelt Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, these input bins aren't a great test case at this point, since they're definitely not valid histogram v3 format (they're non-contiguous because the bins have gaps between them, and zero-width bins should only have a non-zero count if they're the final bin). Ideally with the histogram v3 data_compat.py changes, we should be validating that any histogram data reaching the plugin is valid v3 format, so it shouldn't be possible for data like this to actually reach the frontend.

If you don't mind updating it, bins like this would be more realistic (EDIT: corrected 3rd bin's dx value):

binsToHistogram([
  {x: 0, dx: 1, y: 10},
  {x: 1, dx: 1, y: 10},
  {x: 2, dx: 0, y: 10},
]),

With the expected 2-bin output being [{x: 0, dx: 1, y:10}, {x:1, dx:1, y:20}]. You could also just combine this with the other test case below IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the example to be zero-width bin input in two steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense for the last test case here about multiple steps, but the input bins here could still be changed? Sorry, the GH UI won't let me comment directly, but what I'm referring to is lines 268-272 in this file:

                binsToHistogram([
                  {x: 0, dx: 1, y: 0},
                  {x: 5, dx: 0, y: 200},
                  {x: 9, dx: 1, y: 0},
                ]),

As described below, binsToHistogram() is supposed to simulate a single step's worth of data, but in the valid v3 format, we should never get a single step with bins like this (where there is a zero-width bin that is not the final bin, but is not empty either). Hence the suggestion to modify this input to something more realistic.

Note that I made a mistake in my original comment in one of the dx values in my example, which I've edited to fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the test case.

{x: 0, dx: 5, y: 100},
{x: 5, dx: 5, y: 100},
{x: 0, dx: 5, y: 0},
{x: 5, dx: 5, y: 200},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind also changing the name of this test case? "redistributes 0 width bin evenly over edges of result bins" is no longer quite accurate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this test case since it's basically duplicate with the newly edited test suggested above (produces result bins from multiple 0 width bins in different steps).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I actually commented in the wrong place - I meant to comment on the test name redistributes 0 width bin evenly over edges of result bins (line 263) but GH won't let me comment there. It's the "evenly" that isn't accurate any more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Renamed.

@yatbear yatbear requested a review from nfelt November 2, 2021 02:12
Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long review latency on this, and thank you for the documentation updates! Just added a few more comments on the tests, sorry my original comments weren't all that clear / had some mistakes.

* bins.
* Computes how much of the input bin's 'y' counts should be allocated to this output bin.
*
* Where both bins have non-zero width, this computed by multiplying the input y value by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing word, should be "this is computed" (my bad leaving that out originally)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for noticing!

* [resultLeft, resultRight), except if resultHasRightNeighbor is false, in which case it's
* the closed interval [resultLeft, resultRight].
*
* For example, assuming the middle bin has width 0 and count 20:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably omit this example, especially since the input bins shown aren't technically a valid set of bins and counts - since there is a zero-width bin that isn't the final bin, so it must be a closed-open interval, but it has a non-empty count, which shouldn't be possible.

If we do keep an example, I'd suggest phasing it in terms of the specific arguments to this function, like having a table in which there are columns for the function arguments and for the function return value, and each row represents a specific case, to help illustrate the logic in the paragraph of description. But that might also be better just implemented as a set of test cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

{x: 0, dx: 5, y: 100},
{x: 5, dx: 5, y: 100},
{x: 0, dx: 5, y: 0},
{x: 5, dx: 5, y: 200},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I actually commented in the wrong place - I meant to comment on the test name redistributes 0 width bin evenly over edges of result bins (line 263) but GH won't let me comment there. It's the "evenly" that isn't accurate any more.

'produces result bins with full and partial contributions from ' +
'multiple 0 width bins',
'produces result bins from multiple 0 width bins in different ' +
'steps',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simulate different steps, it would be best if for the input here we call binsToHistogram() more than once, e.g. something like this:

          expect(
            histogramsToBins(
              buildNormalizedHistograms(
                [
                  binsToHistogram([{x: 0, dx: 0, y: 200}]),
                  binsToHistogram([{x: 1.0, dx: 0, y: 100}]),
                ],
                2
              )
            )
          ).toEqual([...]);

That's because each call of binsToHistogram() is meant to represent a single step of data, and in a single step of the valid v3 format, there shouldn't be more than one zero-width bin that has data in it.

The resulting output should be the same; this is just a more accurate representation of how the buildNormalizedHistograms() API is intended to be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using two binsToHistogram above the result becomes:

[
  {x: 0, dx: 0.5, y: 200},
  {x: 0.5, dx: 0.5, y: 0},
],
[
  {x: 0, dx: 0.5, y: 0},
  {x: 0.5, dx: 0.5, y: 100},
]

hence moving this case under the multiple histograms category below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, ok, that makes sense!

@@ -276,8 +276,8 @@ describe('histogram util', () => {
)
).toEqual([
[
{x: 0, dx: 5, y: 100},
{x: 5, dx: 5, y: 100},
{x: 0, dx: 5, y: 0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense for the last test case here about multiple steps, but the input bins here could still be changed? Sorry, the GH UI won't let me comment directly, but what I'm referring to is lines 268-272 in this file:

                binsToHistogram([
                  {x: 0, dx: 1, y: 0},
                  {x: 5, dx: 0, y: 200},
                  {x: 9, dx: 1, y: 0},
                ]),

As described below, binsToHistogram() is supposed to simulate a single step's worth of data, but in the valid v3 format, we should never get a single step with bins like this (where there is a zero-width bin that is not the final bin, but is not empty either). Hence the suggestion to modify this input to something more realistic.

Note that I made a mistake in my original comment in one of the dx values in my example, which I've edited to fix.

@yatbear yatbear merged commit 7859719 into tensorflow:master Nov 4, 2021
@yatbear yatbear deleted the histo_util branch November 4, 2021 22:59
yatbear added a commit to yatbear/tensorboard that referenced this pull request Mar 27, 2023
… data format (tensorflow#5392)

* make timeseries histogram visualization compatible with v3 data format

* improve getBinContribution description (thanks nfelt@!)

* update test for edge case zero-width bin scenarios

* remove duplicate test case

* update doc

* improve test
dna2github pushed a commit to dna2fork/tensorboard that referenced this pull request May 1, 2023
… data format (tensorflow#5392)

* make timeseries histogram visualization compatible with v3 data format

* improve getBinContribution description (thanks nfelt@!)

* update test for edge case zero-width bin scenarios

* remove duplicate test case

* update doc

* improve test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants