Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ (⚠️ devops) 🗃️ Is922 resource tracking/1. version of regular scraping #4380

Conversation

matusdrobuliak66
Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 commented Jun 18, 2023

What do these changes do?

  • 🗃️ introduce new table resource_tracker_container
  • ✨ introduce regular task which will run each 15 minutes (currently scrapes for container resources for jupyter-smash services from Prometheus and stores them in the Postgres table)

PromQL query:

sum without (cpu) (container_cpu_usage_seconds_total{image=~'registry.osparc.io/simcore/services/dynamic/jupyter-smash:.*'})[30m:1m]

We look for all running containers filtered by image name in the last 30 minutes (with a 1-minute resolution). We always update the observed last running timestamp (and also the last observed total container cpu used seconds).

  • ♻️ services/resource-usage-tracker/src/simcore_service_resource_usage_tracker/resource_tracker_cli_placeholder.py can be ignored for now (even though it can be run through CLI). It is the original code provided by DevOps, it is there just as a placeholder and will be removed/refactored in upcoming PRs.

Next steps:

  • adding needed labels to the computational services
  • based on @GitHK input: Always check the last updated row -> and start the data fetching/populating of the database from there.
  • adding endpoint for frontend (so we can get quick feedback)
  • adding useful CLI commands (for example during an outage to be able to run/rerun the concrete task at a specific time)
  • adding an additional test that will test concurrency (inspired by test_multiple_creation_deletion_of_nodes)
  • adding a second data source from the application to get info when computational/dynamic services started/stopped.

Notes & Discussions:

  • as promql doesn't support pagination when querying, we need to think of a list of regex expressions that would cover containers we would like to monitor (which will fetch a reasonable amount of data) and we will run more background tasks for each item in the list
    • OPEN DISCUSSION (with @mrnicegyu11): should we run all background tasks in one container, or should each background task run as a separate container in our docker swarm.
  • NOTE: potentially introduce a daily aggregated table which will aggregate the data needed for billing service (therefore we would be able to keep the resource_tracker_container table data amount under control.
  • NOTE: point of failure
    • resource tracking service is unavailable -> The scheduled task in this service can run even backward, the only important thing is that the Prometheus and its database is available
    • Prometheus outage -> There is a time range where we do not have data because Prometheus was not scraping them -> We will run at least 2 replicas of Prometheus in different nodes
    • There should be another source of data! -> We will store when services were started and stopped from the application.

Related issue/s

How to test

cd services/resource-usage-tracker
make install-dev
pytest tests/unit/with_dbs/

DevOps Checklist

⚠️ 3 new ENV vars need to be added to the osparc-ops-deployment-configuration (to all deployments)

  • RESOURCE_USAGE_TRACKER_PROMETHEUS_URL
  • RESOURCE_USAGE_TRACKER_PROMETHEUS_USERNAME
  • RESOURCE_USAGE_TRACKER_PROMETHEUS_PASSWORD

for now, also using MACHINE_FQDN variable

@matusdrobuliak66 matusdrobuliak66 self-assigned this Jun 18, 2023
@matusdrobuliak66 matusdrobuliak66 added this to the Watermelon milestone Jun 18, 2023
@matusdrobuliak66 matusdrobuliak66 added the a:resource-usage-tracker resource usage tracker service label Jun 18, 2023
@codecov
Copy link

codecov bot commented Jun 18, 2023

Codecov Report

Merging #4380 (99dab0a) into master (b537b67) will decrease coverage by 0.2%.
The diff coverage is 83.4%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #4380     +/-   ##
========================================
- Coverage    86.1%   86.0%   -0.2%     
========================================
  Files         985     993      +8     
  Lines       42384   42484    +100     
  Branches     1006    1007      +1     
========================================
+ Hits        36534   36545     +11     
- Misses       5619    5708     +89     
  Partials      231     231             
Flag Coverage Δ
integrationtests 66.3% <ø> (-1.6%) ⬇️
unittests 83.7% <83.4%> (+<0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../service-library/src/servicelib/db_async_engine.py 0.0% <0.0%> (ø)
...s/catalog/src/simcore_service_catalog/db/events.py 100.0% <ø> (+9.3%) ⬆️
...ce_resource_usage_tracker/resource_tracker_core.py 92.5% <90.6%> (-7.5%) ⬇️
...mcore_postgres_database/models/resource_tracker.py 100.0% <100.0%> (ø)
...catalog/src/simcore_service_catalog/core/events.py 100.0% <100.0%> (ø)
.../src/simcore_service_resource_usage_tracker/cli.py 100.0% <100.0%> (ø)
...service_resource_usage_tracker/core/application.py 100.0% <100.0%> (ø)
...re_service_resource_usage_tracker/core/settings.py 100.0% <100.0%> (ø)
...usage_tracker/models/resource_tracker_container.py 100.0% <100.0%> (ø)
...vice_resource_usage_tracker/modules/db/__init__.py 100.0% <100.0%> (ø)
... and 5 more

... and 12 files with indirect coverage changes

@matusdrobuliak66 matusdrobuliak66 changed the title Is922 resource tracking/adding variables ✨ Is922 resource tracking/1. version of regular scraping Jun 18, 2023
@matusdrobuliak66 matusdrobuliak66 changed the title ✨ Is922 resource tracking/1. version of regular scraping ✨ (⚠️ devops) 🗃️ Is922 resource tracking/1. version of regular scraping Jun 18, 2023
@matusdrobuliak66 matusdrobuliak66 marked this pull request as ready for review June 18, 2023 15:21
Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supernice, looking forward to seeing it in action. Potentially consider some comments and maybe you find the time to answer some of my questions ;)

@matusdrobuliak66 matusdrobuliak66 requested a review from GitHK June 19, 2023 05:22
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! let's check some of the comments together.

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra commends on top of our in person talk

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already see a lot of comments. I rather wait for a second round. Please re-assign review when you are done with the others. Thx!

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! looking forward!

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! looking forward!

@matusdrobuliak66 matusdrobuliak66 enabled auto-merge (squash) June 22, 2023 12:07
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sonarcloud
Copy link

sonarcloud bot commented Jun 22, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@codeclimate
Copy link

codeclimate bot commented Jun 22, 2023

Code Climate has analyzed commit 99dab0a and detected 0 issues on this pull request.

View more on Code Climate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:resource-usage-tracker resource usage tracker service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants