Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics counters for warn! and info! logs #13552

Merged
merged 11 commits into from
Jun 21, 2024

Conversation

brianolson
Copy link
Contributor

Description

Setup for future forge test requirements.
Add metrics counters for info!() and warn!() logs. Future testing can ensure that log volume doesn't grow too much for some amount of forge run.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

Passes tests. Minimal impact.

Key Areas to Review

Rust style.

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jun 4, 2024

⏱️ 13h 59m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-targeted-unit-tests 2h 20m 🟩🟩🟩🟩🟩 (+1 more)
windows-build 1h 46m 🟩🟩🟥
rust-smoke-tests 1h 38m 🟩🟩
rust-move-tests 1h 25m 🟩🟩🟩🟩🟩 (+1 more)
execution-performance / single-node-performance 1h 14m 🟩🟩🟩
forge-framework-upgrade-test / forge 48m 🟩🟩
rust-lints 43m 🟩🟩🟩🟩🟩 (+1 more)
rust-images / rust-all 39m 🟩🟩🟩
forge-compat-test / forge 38m 🟩🟩
forge-e2e-test / forge 37m 🟥🟩
run-tests-main-branch 27m 🟩🟩🟩🟩🟩 (+1 more)
cli-e2e-tests / run-cli-tests 21m 🟩🟩🟩
rust-build-cached-packages 19m 🟩🟩🟩
check 12m 🟩🟩🟩
general-lints 12m 🟩🟩🟩🟩🟩 (+1 more)
test-target-determinator 11m 🟩🟩🟩
execution-performance / test-target-determinator 10m 🟩🟩🟩
check-dynamic-deps 10m 🟩🟩🟩🟩🟩 (+1 more)
node-api-compatibility-tests / node-api-compatibility-tests 3m 🟩🟩🟩
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 35s 🟩🟩🟩
permission-check 20s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 19s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 16s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 14s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 8s 🟩🟩🟩
determine-docker-build-metadata 6s 🟩🟩🟩

🚨 4 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
rust-move-tests 18m 9m +108%
rust-build-cached-packages 9m 5m +72%
rust-targeted-unit-tests 28m 17m +59%
rust-lints 10m 7m +57%

settingsfeedbackdocs ⋅ learn more about trunk.io

Comment on lines 65 to 66
$crate::Level::Warn => { $crate::WARN_LOG_COUNT.inc() },
$crate::Level::Info => { $crate::INFO_LOG_COUNT.inc() },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about perf because this counter is shared and it's in critical path of logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics counters should be implemented in terms of atomic int increment, right? fast enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the metric is slowing things down, the logging is too much!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to contention between totally unrelated threads, even if its atomic int. The CPU still needs to get a "lock" on the counter in some sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only check if a given log line is enabled down below, but we just increment the counter regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather collect these stats when we actually decide to print them, on the other side of the channel. That way it's not in the critical path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Okay! Moved all the checks inside the enabled check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I wasn't clear earlier. This is still on the critical path. This macro puts the log event into a channel which gets polled by a task and processed. I was suggesting to count there, instead of here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STRUCT_LOG_COUNT.inc(); is also currently on the call path before data gets to the queue. That's for all log events at any level.

The queue uses a try_send which may fail and drop the message. Counting things before that has value.

I still think atomic counters are pretty cheap and STRUCT_LOG_COUNT is okay and one more counter would be okay too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, moved all the counting to the thread on the other side of the logging channel

@brianolson
Copy link
Contributor Author

e.g. recent local test output includes (info and warn are new)

aptos_info_log_count 26298
aptos_struct_log_count 40259
aptos_warn_log_count 204

@brianolson brianolson requested a review from bchocho June 5, 2024 16:38
Copy link
Contributor

@ibalajiarun ibalajiarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@brianolson brianolson enabled auto-merge (squash) June 6, 2024 22:21

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

auto-merge was automatically disabled June 7, 2024 21:38

Base branch was modified

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> 29d62070638f91a71be6bacaf5079581482b528f

Compatibility test results for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> 29d62070638f91a71be6bacaf5079581482b528f (PR)
1. Check liveness of validators at old version: f648076a280621dbfd4e73b1ca83e3a3f52878ed
compatibility::simple-validator-upgrade::liveness-check : committed: 6419.900263464223 txn/s, latency: 4745.966009098428 ms, (p50: 4900 ms, p90: 7700 ms, p99: 10200 ms), latency samples: 241800
2. Upgrading first Validator to new version: 29d62070638f91a71be6bacaf5079581482b528f
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1400.9446971754892 txn/s, latency: 18529.933619309082 ms, (p50: 21300 ms, p90: 30700 ms, p99: 32300 ms), latency samples: 87420
3. Upgrading rest of first batch to new version: 29d62070638f91a71be6bacaf5079581482b528f
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 3029.9264384909716 txn/s, latency: 10164.248791156355 ms, (p50: 11200 ms, p90: 13900 ms, p99: 14300 ms), latency samples: 125740
4. upgrading second batch to new version: 29d62070638f91a71be6bacaf5079581482b528f
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6356.014508268364 txn/s, latency: 5109.078612472536 ms, (p50: 4900 ms, p90: 8200 ms, p99: 9400 ms), latency samples: 236680
5. check swarm health
Compatibility test for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> 29d62070638f91a71be6bacaf5079581482b528f passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 29d62070638f91a71be6bacaf5079581482b528f

two traffics test: inner traffic : committed: 8559.339675423138 txn/s, latency: 4578.191810403111 ms, (p50: 4500 ms, p90: 5600 ms, p99: 9600 ms), latency samples: 3697740
two traffics test : committed: 99.93089020835562 txn/s, latency: 1994.0670588235294 ms, (p50: 2000 ms, p90: 2300 ms, p99: 2400 ms), latency samples: 1700
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.217, avg: 0.214", "QsPosToProposal: max: 0.257, avg: 0.248", "ConsensusProposalToOrdered: max: 0.317, avg: 0.292", "ConsensusOrderedToCommit: max: 0.381, avg: 0.366", "ConsensusProposalToCommit: max: 0.669, avg: 0.658"]
Max round gap was 1 [limit 4] at version 1310196. Max no progress secs was 4.841502 [limit 15] at version 1310196.
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> 29d62070638f91a71be6bacaf5079581482b528f

Compatibility test results for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> 29d62070638f91a71be6bacaf5079581482b528f (PR)
Upgrade the nodes to version: 29d62070638f91a71be6bacaf5079581482b528f
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1216.5826277548501 txn/s, submitted: 1219.1128619967303 txn/s, failed submission: 2.530234241880006 txn/s, expired: 2.530234241880006 txn/s, latency: 2729.0494611457743 ms, (p50: 2100 ms, p90: 4500 ms, p99: 8700 ms), latency samples: 105780
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1233.3047311812584 txn/s, submitted: 1236.76842168261 txn/s, failed submission: 3.4636905013515964 txn/s, expired: 3.4636905013515964 txn/s, latency: 2486.1750327654 ms, (p50: 1800 ms, p90: 4500 ms, p99: 8600 ms), latency samples: 106820
5. check swarm health
Compatibility test for f648076a280621dbfd4e73b1ca83e3a3f52878ed ==> 29d62070638f91a71be6bacaf5079581482b528f passed
Upgrade the remaining nodes to version: 29d62070638f91a71be6bacaf5079581482b528f
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1207.6882472793413 txn/s, submitted: 1209.662310887317 txn/s, failed submission: 1.974063607975676 txn/s, expired: 1.974063607975676 txn/s, latency: 2657.8918089357066 ms, (p50: 2100 ms, p90: 5100 ms, p99: 8400 ms), latency samples: 110120
Test Ok

@brianolson brianolson merged commit 43d709c into aptos-labs:main Jun 21, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants