Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round decimal record difference percent to three decimals after zeros #4065

Conversation

mattfergoda
Copy link
Contributor

@mattfergoda mattfergoda commented Apr 8, 2024

Fixes

Fixes #1581 by @AetherUnbound

Description

Round the data refresh percent change report to 3 digits after the decimal, ignoring zeros. E.g. 100.012365 -> 100.0124.

Testing Instructions

just catalog/dags/test

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 📄 aspect: text Concerns the textual material in the repository 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🏷 status: label work required Needs proper labelling before it can be worked on labels Apr 8, 2024
@mattfergoda mattfergoda marked this pull request as ready for review April 8, 2024 20:04
@mattfergoda mattfergoda requested a review from a team as a code owner April 8, 2024 20:04
@mattfergoda mattfergoda requested review from krysal and stacimc April 8, 2024 20:04
@AetherUnbound AetherUnbound changed the title Round decimal record difference percentages to three decimals after z… Round decimal record difference percent to three decimals after zeros Apr 8, 2024
@AetherUnbound AetherUnbound added 🧱 stack: catalog Related to the catalog and Airflow DAGs and removed 🏷 status: label work required Needs proper labelling before it can be worked on 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 8, 2024
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mattfergoda! I'm pretty sure there is a simpler way to accomplish this, as long as we don't get too caught up with always wanting three digits on very small decimals. For our context, I'd say that level of precision is not worth the code required to achieve it, when using format-spec will give us all the information we could possibly actually need and use.

Comment on lines 41 to 58
def _round_sig_figs(num, n):
"""
Round number to the nearest n digits.
If num is an integer, don't change it.
If num has a decimal, round to 3 significant figures (after any zeros),
rounding up if the next digit is 5 or greater.
"""

if num % 1 == 0:
return int(num)

# Round decimal part to 3 sig figs. Leave int part untouched.
[int_part, dec_part] = str(num).split(".")

# Have to awkwardly prepend the decimal part with '0.' to get correct
# rounding behavior, then slice '0.' off before concatenating
# the final number.
return float(int_part + "." + "{:.{n}g}".format(float("0." + dec_part), n=n)[2:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is over complicating things a bit, I think for the sake of always having 3 significant digits. Using some fancy string format spec, the whole thing can be simplified, so long as we don't care that 0.00005646% will get truncated to just 0.00006%... and frankly, at such small percentages, I really don't think it matters to have accuracy past 1/100, it just isn't interesting or useful information to have such abstract and small percentages.

I've left a suggestion on the line above for format spec to use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarayourfriend I was talking about this with @mattfergoda in person - many times for the data refresh, we do have changes that are quite small by percentage but still register as a +/- half a million record difference. Things like 0.0654% or even 0.00481%. For me, and perhaps this is just my own perspective, truncating those down completely does lose some of the specificity we have. Maybe it doesn't matter, but whatever change we make I'd like to at least be sure we have a few digits available a couple of decimal points out.

Copy link
Collaborator

@sarayourfriend sarayourfriend Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "truncating them down completely"? As I wrote:

0.00005646% will get truncated to just 0.00006%

In what way is this a meaningful difference in the information, when the actual count is also present? And in particular balanced against the astronomical difference in complexity of the two implementations? One is literally a single line, and the other is an entirely new, rather obscure function, used in precisely one place, to give digits of significance in percentage that are beyond the ability for humans to even understand. For percentages this small, the count is far more useful, I would think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all fair points! We can go with the much simpler implementation with that in mind 🙂

@@ -30,9 +30,29 @@ def report_record_difference(before: str, after: str, media_type: str, dag_id: s
Data refresh for {media_type} complete! :tada:
_Note: All values are row estimates and are not (but nearly) exact_
*Record count difference for `{media_type}`*: {before:,} → {after:,}
*Change*: {count_diff:+,} ({percent_diff:+}% Δ)
*Change*: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned below, the _round_sig_figs can be removed if we simplify things using just format spec significant digits instructions.

If we use g, the general format, this will use scientific notation in some cases:

>>> "{n:+3g}".format(n=0.0000512346888129371512345)
'+5.12347e-05'

So, I think it would be better to remove the * 100 on line 24, and use the % spec instead:

>>> "{n:+3%}".format(n=0.000000512346888129371512345)
'+0.000051%'
Suggested change
*Change*: {count_diff:+,} ({_round_sig_figs(percent_diff, 3):+}% Δ)
*Change*: {count_diff:+,} (percent_diff:+3%} Δ)

If you apply that suggestion, please update line 24 as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the 3 here does not seem to have any effect 🤔

[nav] In [3]: "{n:+3%}".format(n=0.001000512346888129371512345)
Out[3]: '+0.100051%'

[nav] In [4]: "{n:+3%}".format(n=0.001000512340888129371512345)
Out[4]: '+0.100051%'

[nav] In [5]: "{n:+3%}".format(n=0.0000512340888129371512345)
Out[5]: '+0.005123%'

[nav] In [6]: "{n:+2%}".format(n=0.0000512340888129371512345)
Out[6]: '+0.005123%'

[nav] In [7]: "{n:+1%}".format(n=0.0000512340888129371512345)
Out[7]: '+0.005123%'

[nav] In [8]: "{n:+%}".format(n=0.0000512340888129371512345)
Out[8]: '+0.005123%'

Copy link
Contributor Author

@mattfergoda mattfergoda Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "{n:+3%}".format(n=...) is the agreed way forward, I'm happy to make this change and update the tests.

Copy link
Collaborator

@sarayourfriend sarayourfriend Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number is apparently ignored for %, TIL. It does have an effect on general formatting, and I converted to % from that, not realising it wouldn't make a difference.

>>> "{n:+3g}".format(n=52.212331223456234512)
'+52.2123'
>>> "{n:+3g}".format(n=0.001000512346888129371512345)
'+0.00100051'
>>> "{n:+3g}".format(n=0.6888129371512345)
'+0.688813'
>>> "{n:+3g}".format(n=0.000000000000000000000000888129371512345)
'+8.88129e-25'

The format spec just needs to be n:+% then, but only if this question pans out in favour of the simpler format spec approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sara's right, I think this approach will be way more straightforward than anything I'm suggesting. If you're comfortable with making those changes Matt, please proceed!

@openverse-bot
Copy link
Collaborator

Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 7 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)2.

@mattfergoda, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@sarayourfriend sarayourfriend marked this pull request as draft April 18, 2024 01:40
@mattfergoda mattfergoda marked this pull request as ready for review April 22, 2024 18:05
@mattfergoda
Copy link
Contributor Author

@AetherUnbound @sarayourfriend I simplified the rounding logic based on the above conversation and updated the tests accordingly. I added a parametrized test case to the existing ones where the decimal value is truncated.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattfergoda, thanks for updating the PR. Left one comment, but it's unclear to me whether a change is needed until CI runs. Can you rebase the PR for that?

catalog/dags/data_refresh/reporting.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @mattfergoda, and for your patience while iterating on this issue! It looks like GitHub is having an outage at the moment which is preventing CI from running, but this LGTM.

@AetherUnbound
Copy link
Collaborator

Ah, it looks like this is conflicting with #4067 - it seems easiest to squash the commits and rebase, in the interest of time I'll go ahead and take care of that.

@AetherUnbound AetherUnbound force-pushed the 1581-Truncate-data-refresh-percent-change-report-to-3-digits-after-the-decimal branch from 2cb49e7 to 21ea1f4 Compare April 24, 2024 17:34
@mattfergoda
Copy link
Contributor Author

Thank you for the contribution @mattfergoda, and for your patience while iterating on this issue! It looks like GitHub is having an outage at the moment which is preventing CI from running, but this LGTM.

@AetherUnbound @sarayourfriend thanks for letting me contribute and for the mentorship and feedback!

…eros.

Co-authored-by: sarayourfriend <[email protected]>
Co-authored-by: Matt Fergoda <[email protected]>
@AetherUnbound AetherUnbound force-pushed the 1581-Truncate-data-refresh-percent-change-report-to-3-digits-after-the-decimal branch from 21ea1f4 to 387c383 Compare April 24, 2024 22:26
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT! Thanks again @mattfergoda 👍

@AetherUnbound AetherUnbound merged commit 278f6b1 into WordPress:main Apr 25, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Truncate data refresh percent change report to 3 digits after the decimal
4 participants