-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Round decimal record difference percent to three decimals after zeros #4065
Round decimal record difference percent to three decimals after zeros #4065
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @mattfergoda! I'm pretty sure there is a simpler way to accomplish this, as long as we don't get too caught up with always wanting three digits on very small decimals. For our context, I'd say that level of precision is not worth the code required to achieve it, when using format-spec will give us all the information we could possibly actually need and use.
def _round_sig_figs(num, n): | ||
""" | ||
Round number to the nearest n digits. | ||
If num is an integer, don't change it. | ||
If num has a decimal, round to 3 significant figures (after any zeros), | ||
rounding up if the next digit is 5 or greater. | ||
""" | ||
|
||
if num % 1 == 0: | ||
return int(num) | ||
|
||
# Round decimal part to 3 sig figs. Leave int part untouched. | ||
[int_part, dec_part] = str(num).split(".") | ||
|
||
# Have to awkwardly prepend the decimal part with '0.' to get correct | ||
# rounding behavior, then slice '0.' off before concatenating | ||
# the final number. | ||
return float(int_part + "." + "{:.{n}g}".format(float("0." + dec_part), n=n)[2:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is over complicating things a bit, I think for the sake of always having 3 significant digits. Using some fancy string format spec, the whole thing can be simplified, so long as we don't care that 0.00005646%
will get truncated to just 0.00006%
... and frankly, at such small percentages, I really don't think it matters to have accuracy past 1/100, it just isn't interesting or useful information to have such abstract and small percentages.
I've left a suggestion on the line above for format spec to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend I was talking about this with @mattfergoda in person - many times for the data refresh, we do have changes that are quite small by percentage but still register as a +/- half a million record difference. Things like 0.0654%
or even 0.00481%
. For me, and perhaps this is just my own perspective, truncating those down completely does lose some of the specificity we have. Maybe it doesn't matter, but whatever change we make I'd like to at least be sure we have a few digits available a couple of decimal points out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "truncating them down completely"? As I wrote:
0.00005646% will get truncated to just 0.00006%
In what way is this a meaningful difference in the information, when the actual count is also present? And in particular balanced against the astronomical difference in complexity of the two implementations? One is literally a single line, and the other is an entirely new, rather obscure function, used in precisely one place, to give digits of significance in percentage that are beyond the ability for humans to even understand. For percentages this small, the count is far more useful, I would think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are all fair points! We can go with the much simpler implementation with that in mind 🙂
@@ -30,9 +30,29 @@ def report_record_difference(before: str, after: str, media_type: str, dag_id: s | |||
Data refresh for {media_type} complete! :tada: | |||
_Note: All values are row estimates and are not (but nearly) exact_ | |||
*Record count difference for `{media_type}`*: {before:,} → {after:,} | |||
*Change*: {count_diff:+,} ({percent_diff:+}% Δ) | |||
*Change*: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned below, the _round_sig_figs
can be removed if we simplify things using just format spec significant digits instructions.
If we use g
, the general format, this will use scientific notation in some cases:
>>> "{n:+3g}".format(n=0.0000512346888129371512345)
'+5.12347e-05'
So, I think it would be better to remove the * 100
on line 24, and use the %
spec instead:
>>> "{n:+3%}".format(n=0.000000512346888129371512345)
'+0.000051%'
*Change*: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ) | |
*Change*: {count_diff:+,} (percent_diff:+3%} Δ) |
If you apply that suggestion, please update line 24 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the 3
here does not seem to have any effect 🤔
[nav] In [3]: "{n:+3%}".format(n=0.001000512346888129371512345)
Out[3]: '+0.100051%'
[nav] In [4]: "{n:+3%}".format(n=0.001000512340888129371512345)
Out[4]: '+0.100051%'
[nav] In [5]: "{n:+3%}".format(n=0.0000512340888129371512345)
Out[5]: '+0.005123%'
[nav] In [6]: "{n:+2%}".format(n=0.0000512340888129371512345)
Out[6]: '+0.005123%'
[nav] In [7]: "{n:+1%}".format(n=0.0000512340888129371512345)
Out[7]: '+0.005123%'
[nav] In [8]: "{n:+%}".format(n=0.0000512340888129371512345)
Out[8]: '+0.005123%'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "{n:+3%}".format(n=...)
is the agreed way forward, I'm happy to make this change and update the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number is apparently ignored for %
, TIL. It does have an effect on general formatting, and I converted to %
from that, not realising it wouldn't make a difference.
>>> "{n:+3g}".format(n=52.212331223456234512)
'+52.2123'
>>> "{n:+3g}".format(n=0.001000512346888129371512345)
'+0.00100051'
>>> "{n:+3g}".format(n=0.6888129371512345)
'+0.688813'
>>> "{n:+3g}".format(n=0.000000000000000000000000888129371512345)
'+8.88129e-25'
The format spec just needs to be n:+%
then, but only if this question pans out in favour of the simpler format spec approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sara's right, I think this approach will be way more straightforward than anything I'm suggesting. If you're comfortable with making those changes Matt, please proceed!
Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 7 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)2. @mattfergoda, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
@AetherUnbound @sarayourfriend I simplified the rounding logic based on the above conversation and updated the tests accordingly. I added a parametrized test case to the existing ones where the decimal value is truncated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattfergoda, thanks for updating the PR. Left one comment, but it's unclear to me whether a change is needed until CI runs. Can you rebase the PR for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @mattfergoda, and for your patience while iterating on this issue! It looks like GitHub is having an outage at the moment which is preventing CI from running, but this LGTM.
Ah, it looks like this is conflicting with #4067 - it seems easiest to squash the commits and rebase, in the interest of time I'll go ahead and take care of that. |
2cb49e7
to
21ea1f4
Compare
@AetherUnbound @sarayourfriend thanks for letting me contribute and for the mentorship and feedback! |
…eros. Co-authored-by: sarayourfriend <[email protected]> Co-authored-by: Matt Fergoda <[email protected]>
21ea1f4
to
387c383
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT! Thanks again @mattfergoda 👍
Fixes
Fixes #1581 by @AetherUnbound
Description
Round the data refresh percent change report to 3 digits after the decimal, ignoring zeros. E.g. 100.012365 -> 100.0124.
Testing Instructions
just catalog/dags/test
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin