Round decimal record difference percent to three decimals after zeros #4065

mattfergoda · 2024-04-08T19:48:48Z

Fixes

Description

Round the data refresh percent change report to 3 digits after the decimal, ignoring zeros. E.g. 100.012365 -> 100.0124.

Testing Instructions

just catalog/dags/test

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend

Thanks for the PR @mattfergoda! I'm pretty sure there is a simpler way to accomplish this, as long as we don't get too caught up with always wanting three digits on very small decimals. For our context, I'd say that level of precision is not worth the code required to achieve it, when using format-spec will give us all the information we could possibly actually need and use.

sarayourfriend · 2024-04-09T02:25:55Z

catalog/dags/data_refresh/reporting.py

+def _round_sig_figs(num, n):
+    """
+    Round number to the nearest n digits.
+    If num is an integer, don't change it.
+    If num has a decimal, round to 3 significant figures (after any zeros),
+    rounding up if the next digit is 5 or greater.
+    """
+
+    if num % 1 == 0:
+        return int(num)
+
+    # Round decimal part to 3 sig figs. Leave int part untouched.
+    [int_part, dec_part] = str(num).split(".")
+
+    # Have to awkwardly prepend the decimal part with '0.' to get correct
+    # rounding behavior, then slice '0.' off before concatenating
+    # the final number.
+    return float(int_part + "." + "{:.{n}g}".format(float("0." + dec_part), n=n)[2:])


This is over complicating things a bit, I think for the sake of always having 3 significant digits. Using some fancy string format spec, the whole thing can be simplified, so long as we don't care that 0.00005646% will get truncated to just 0.00006%... and frankly, at such small percentages, I really don't think it matters to have accuracy past 1/100, it just isn't interesting or useful information to have such abstract and small percentages.

I've left a suggestion on the line above for format spec to use.

@sarayourfriend I was talking about this with @mattfergoda in person - many times for the data refresh, we do have changes that are quite small by percentage but still register as a +/- half a million record difference. Things like 0.0654% or even 0.00481%. For me, and perhaps this is just my own perspective, truncating those down completely does lose some of the specificity we have. Maybe it doesn't matter, but whatever change we make I'd like to at least be sure we have a few digits available a couple of decimal points out.

What do you mean by "truncating them down completely"? As I wrote:

0.00005646% will get truncated to just 0.00006%

In what way is this a meaningful difference in the information, when the actual count is also present? And in particular balanced against the astronomical difference in complexity of the two implementations? One is literally a single line, and the other is an entirely new, rather obscure function, used in precisely one place, to give digits of significance in percentage that are beyond the ability for humans to even understand. For percentages this small, the count is far more useful, I would think.

Those are all fair points! We can go with the much simpler implementation with that in mind 🙂

sarayourfriend · 2024-04-09T02:27:44Z

catalog/dags/data_refresh/reporting.py

@@ -30,9 +30,29 @@ def report_record_difference(before: str, after: str, media_type: str, dag_id: s
 Data refresh for {media_type} complete! :tada:
 _Note: All values are row estimates and are not (but nearly) exact_
 *Record count difference for `{media_type}`*: {before:,} → {after:,}
-*Change*: {count_diff:+,} ({percent_diff:+}% Δ)
+*Change*: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ)


As mentioned below, the _round_sig_figs can be removed if we simplify things using just format spec significant digits instructions.

If we use g, the general format, this will use scientific notation in some cases:

>>> "{n:+3g}".format(n=0.0000512346888129371512345) '+5.12347e-05'

So, I think it would be better to remove the * 100 on line 24, and use the % spec instead:

>>> "{n:+3%}".format(n=0.000000512346888129371512345) '+0.000051%'

Suggested change

*Change*: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ)

*Change*: {count_diff:+,} (percent_diff:+3%} Δ)

If you apply that suggestion, please update line 24 as well.

FWIW the 3 here does not seem to have any effect 🤔

[nav] In [3]: "{n:+3%}".format(n=0.001000512346888129371512345) Out[3]: '+0.100051%' [nav] In [4]: "{n:+3%}".format(n=0.001000512340888129371512345) Out[4]: '+0.100051%' [nav] In [5]: "{n:+3%}".format(n=0.0000512340888129371512345) Out[5]: '+0.005123%' [nav] In [6]: "{n:+2%}".format(n=0.0000512340888129371512345) Out[6]: '+0.005123%' [nav] In [7]: "{n:+1%}".format(n=0.0000512340888129371512345) Out[7]: '+0.005123%' [nav] In [8]: "{n:+%}".format(n=0.0000512340888129371512345) Out[8]: '+0.005123%'

If "{n:+3%}".format(n=...) is the agreed way forward, I'm happy to make this change and update the tests.

The number is apparently ignored for %, TIL. It does have an effect on general formatting, and I converted to % from that, not realising it wouldn't make a difference.

>>> "{n:+3g}".format(n=52.212331223456234512) '+52.2123' >>> "{n:+3g}".format(n=0.001000512346888129371512345) '+0.00100051' >>> "{n:+3g}".format(n=0.6888129371512345) '+0.688813' >>> "{n:+3g}".format(n=0.000000000000000000000000888129371512345) '+8.88129e-25'

The format spec just needs to be n:+% then, but only if this question pans out in favour of the simpler format spec approach.

Sara's right, I think this approach will be way more straightforward than anything I'm suggesting. If you're comfortable with making those changes Matt, please proceed!

openverse-bot · 2024-04-18T00:00:11Z

Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 7 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)².

@mattfergoda, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

mattfergoda · 2024-04-22T18:16:32Z

@AetherUnbound @sarayourfriend I simplified the rounding logic based on the above conversation and updated the tests accordingly. I added a parametrized test case to the existing ones where the decimal value is truncated.

sarayourfriend

@mattfergoda, thanks for updating the PR. Left one comment, but it's unclear to me whether a change is needed until CI runs. Can you rebase the PR for that?

catalog/dags/data_refresh/reporting.py

AetherUnbound

Thank you for the contribution @mattfergoda, and for your patience while iterating on this issue! It looks like GitHub is having an outage at the moment which is preventing CI from running, but this LGTM.

AetherUnbound · 2024-04-24T17:21:58Z

Ah, it looks like this is conflicting with #4067 - it seems easiest to squash the commits and rebase, in the interest of time I'll go ahead and take care of that.

mattfergoda · 2024-04-24T18:14:30Z

Thank you for the contribution @mattfergoda, and for your patience while iterating on this issue! It looks like GitHub is having an outage at the moment which is preventing CI from running, but this LGTM.

@AetherUnbound @sarayourfriend thanks for letting me contribute and for the mentorship and feedback!

…eros. Co-authored-by: sarayourfriend <[email protected]> Co-authored-by: Matt Fergoda <[email protected]>

sarayourfriend

LGMT! Thanks again @mattfergoda 👍

mattfergoda marked this pull request as ready for review April 8, 2024 20:04

mattfergoda requested a review from a team as a code owner April 8, 2024 20:04

mattfergoda requested review from krysal and stacimc April 8, 2024 20:04

AetherUnbound changed the title ~~Round decimal record difference percentages to three decimals after z…~~ Round decimal record difference percent to three decimals after zeros Apr 8, 2024

AetherUnbound added 🧱 stack: catalog Related to the catalog and Airflow DAGs and removed 🏷 status: label work required Needs proper labelling before it can be worked on 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 8, 2024

sarayourfriend requested changes Apr 9, 2024

View reviewed changes

sarayourfriend marked this pull request as draft April 18, 2024 01:40

mattfergoda marked this pull request as ready for review April 22, 2024 18:05

mattfergoda requested a review from sarayourfriend April 22, 2024 18:16

mattfergoda requested a review from AetherUnbound April 22, 2024 18:16

sarayourfriend reviewed Apr 23, 2024

View reviewed changes

catalog/dags/data_refresh/reporting.py Outdated Show resolved Hide resolved

AetherUnbound approved these changes Apr 24, 2024

View reviewed changes

AetherUnbound force-pushed the 1581-Truncate-data-refresh-percent-change-report-to-3-digits-after-the-decimal branch from 2cb49e7 to 21ea1f4 Compare April 24, 2024 17:34

Round decimal record difference percentages to three decimals after z…

387c383

…eros. Co-authored-by: sarayourfriend <[email protected]> Co-authored-by: Matt Fergoda <[email protected]>

AetherUnbound force-pushed the 1581-Truncate-data-refresh-percent-change-report-to-3-digits-after-the-decimal branch from 21ea1f4 to 387c383 Compare April 24, 2024 22:26

sarayourfriend approved these changes Apr 24, 2024

View reviewed changes

AetherUnbound merged commit 278f6b1 into WordPress:main Apr 25, 2024
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round decimal record difference percent to three decimals after zeros #4065

Round decimal record difference percent to three decimals after zeros #4065

mattfergoda commented Apr 8, 2024 •

edited by AetherUnbound

Loading

sarayourfriend left a comment

sarayourfriend Apr 9, 2024

AetherUnbound Apr 11, 2024

sarayourfriend Apr 16, 2024 •

edited

Loading

AetherUnbound Apr 16, 2024

sarayourfriend Apr 9, 2024

AetherUnbound Apr 11, 2024

mattfergoda Apr 12, 2024 •

edited

Loading

sarayourfriend Apr 16, 2024 •

edited

Loading

AetherUnbound Apr 18, 2024

openverse-bot commented Apr 18, 2024

mattfergoda commented Apr 22, 2024

sarayourfriend left a comment

AetherUnbound left a comment

AetherUnbound commented Apr 24, 2024

mattfergoda commented Apr 24, 2024

sarayourfriend left a comment

	Change: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ)
	Change: {count_diff:+,} (percent_diff:+3%} Δ)

Round decimal record difference percent to three decimals after zeros #4065

Round decimal record difference percent to three decimals after zeros #4065

Conversation

mattfergoda commented Apr 8, 2024 • edited by AetherUnbound Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend left a comment

Choose a reason for hiding this comment

sarayourfriend Apr 9, 2024

Choose a reason for hiding this comment

AetherUnbound Apr 11, 2024

Choose a reason for hiding this comment

sarayourfriend Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

AetherUnbound Apr 16, 2024

Choose a reason for hiding this comment

sarayourfriend Apr 9, 2024

Choose a reason for hiding this comment

AetherUnbound Apr 11, 2024

Choose a reason for hiding this comment

mattfergoda Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

sarayourfriend Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

AetherUnbound Apr 18, 2024

Choose a reason for hiding this comment

openverse-bot commented Apr 18, 2024

Footnotes

mattfergoda commented Apr 22, 2024

sarayourfriend left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound commented Apr 24, 2024

mattfergoda commented Apr 24, 2024

sarayourfriend left a comment

Choose a reason for hiding this comment

mattfergoda commented Apr 8, 2024 •

edited by AetherUnbound

Loading

sarayourfriend Apr 16, 2024 •

edited

Loading

mattfergoda Apr 12, 2024 •

edited

Loading

sarayourfriend Apr 16, 2024 •

edited

Loading