Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logfire.instrument_system_metrics() #373

Merged
merged 55 commits into from
Aug 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
b713282
system_metrics.py
alexmojaki Aug 8, 2024
7400de8
use OTEL config for most metrics
alexmojaki Aug 8, 2024
c1c4e6b
tests
alexmojaki Aug 8, 2024
13d0baa
Update generated stubs
alexmojaki Aug 8, 2024
dd05380
Update generated stubs
alexmojaki Aug 8, 2024
ac2b5d7
Remove collect_system_metrics and old code
alexmojaki Aug 8, 2024
f882a10
Test config errors
alexmojaki Aug 8, 2024
4a9ade1
Update generated stubs
alexmojaki Aug 8, 2024
be7282c
Don't uninstrument automatically
alexmojaki Aug 13, 2024
6d7f158
Merge branch 'main' into alex/instrument-system-metrics
alexmojaki Aug 13, 2024
ea40267
wip
alexmojaki Aug 14, 2024
7b397d6
docs for new API
alexmojaki Aug 15, 2024
48c31bd
Remove available metrics section
alexmojaki Aug 15, 2024
68e3f28
Remove logfire prefix
alexmojaki Aug 15, 2024
7a90d6d
simpler API with smaller defaults
alexmojaki Aug 15, 2024
2ab5017
test bases
alexmojaki Aug 15, 2024
1087dc1
test_custom_system_metrics_collection
alexmojaki Aug 15, 2024
19fca49
Fix test_full_base
alexmojaki Aug 15, 2024
a3d5335
smarter simple_utilization
alexmojaki Aug 16, 2024
c9e889b
MetricName also needs to be updated
alexmojaki Aug 16, 2024
34ed55c
pyright
alexmojaki Aug 16, 2024
fb9a784
pragma
alexmojaki Aug 16, 2024
44c17c1
Update generated stubs
alexmojaki Aug 16, 2024
3ad0209
Update generated stubs
alexmojaki Aug 16, 2024
142ba70
3.8
alexmojaki Aug 16, 2024
8c78289
rename dashboard
alexmojaki Aug 16, 2024
d1dd843
Document both basic system metrics dashboards
alexmojaki Aug 16, 2024
edb3ad8
update metrics docs
alexmojaki Aug 16, 2024
86073d5
docstring
alexmojaki Aug 16, 2024
8aa1d1b
docstring
alexmojaki Aug 16, 2024
5ec79e5
comments
alexmojaki Aug 19, 2024
3fc0971
comments
alexmojaki Aug 19, 2024
2cc9e1a
uninstrument automatically
alexmojaki Aug 19, 2024
8b69741
pin griffe
alexmojaki Aug 19, 2024
9bb6b43
pin griffe
alexmojaki Aug 19, 2024
f9d86eb
pin griffe
alexmojaki Aug 19, 2024
da3e3fa
Update generated stubs
alexmojaki Aug 19, 2024
389b920
format link
alexmojaki Aug 19, 2024
e953802
format link
alexmojaki Aug 19, 2024
8ea604f
Merge branch 'main' into alex/instrument-system-metrics
alexmojaki Aug 19, 2024
99ada07
Merge branch 'alex/instrument-system-metrics' of github.com:pydantic/…
alexmojaki Aug 20, 2024
900c8e6
Merge branch 'main' of github.com:pydantic/logfire into alex/instrume…
alexmojaki Aug 20, 2024
e8b0e26
add popover explaining None value
alexmojaki Aug 20, 2024
a35d665
warn about costs
alexmojaki Aug 20, 2024
cdbcb84
Link to guide in configure param docs
alexmojaki Aug 20, 2024
86acff7
Merge branch 'main' of github.com:pydantic/logfire into alex/instrume…
alexmojaki Aug 20, 2024
d7f4902
Split into two CPU metrics
alexmojaki Aug 21, 2024
be5e9ef
document each basic metric
alexmojaki Aug 21, 2024
bae2c6d
Apply review suggestions to docs
alexmojaki Aug 21, 2024
b2144ad
Ensure process.runtime.cpu.utilization values don't start at 0
alexmojaki Aug 21, 2024
90df9ca
fix check
alexmojaki Aug 21, 2024
f0b8d86
fix range of values of process.runtime.cpu.utilization
alexmojaki Aug 22, 2024
c39b697
Merge branch 'main' of github.com:pydantic/logfire into alex/instrume…
alexmojaki Aug 22, 2024
2102b39
Update descriptions of dashboards
alexmojaki Aug 22, 2024
0ba76a7
Merge branch 'main' into alex/instrument-system-metrics
alexmojaki Aug 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 11 additions & 18 deletions docs/guides/onboarding_checklist/add_metrics.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
**Pydantic Logfire** can be used to collect metrics from your application and send them to a metrics backend.

Let's see how to create, and use metrics in your application.
Metrics are a great way to record numerical values where you want to see an aggregation of the data (e.g. over time),
rather than the individual values.

## System Metrics

The easiest way to start using metrics is to enable system metrics.
See the [System Metrics][system-metrics] documentation to learn more.

## Manual Metrics

Let's see how to create and use custom metrics in your application.

```py
import logfire
Expand All @@ -13,11 +23,6 @@ def send_message():
messages_sent.add(1)
```

## Metric Types

Metrics are a great way to record number values where you want to see an aggregation of the data (e.g. over time),
rather than the individual values.

### Counter

The Counter metric is particularly useful when you want to measure the frequency or occurrence of a certain
Expand Down Expand Up @@ -250,18 +255,6 @@ logfire.metric_up_down_counter_callback(

You can read more about the Up-Down Counter metric in the [OpenTelemetry documentation][up-down-counter-callback-metric].

## System Metrics

By default, **Logfire** does not collect system metrics.

To enable metrics, you need just need install the `logfire[system-metrics]` extra:

{{ install_logfire(extras=['system-metrics']) }}

**Logfire** will automatically collect system metrics if the `logfire[system-metrics]` extra is installed.

To know more about which system metrics are collected, check the [System Metrics][system-metrics] documentation.

[counter-metric]: https://opentelemetry.io/docs/specs/otel/metrics/api/#counter
[histogram-metric]: https://opentelemetry.io/docs/specs/otel/metrics/api/#histogram
[up-down-counter-metric]: https://opentelemetry.io/docs/specs/otel/metrics/api/#updowncounter
Expand Down
16 changes: 11 additions & 5 deletions docs/guides/web_ui/dashboards.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,20 @@ This dashboard offers a high-level view of your web services' well-being. It lik
* **Percent of 5XX Requests:** Percentage of requests that resulted in server errors (status codes in the 500 range).
* **Log Type Ratio**: Breakdown of the different log types generated by your web service (e.g., info, warning, error).

## System Metrics
## Basic System Metrics

This dashboard focuses on system resource utilization, potentially including:
This dashboard shows essential system resource utilization metrics. It comes in two variants:

- **Basic System Metrics (Logfire):** Uses the data exported by [`logfire.instrument_system_metrics()`](../../integrations/system_metrics.md).
- **Basic System Metrics (OpenTelemetry):** Uses data exported by any OpenTelemetry-based instrumentation following the standard semantic conventions.

Both variants include the following metrics:

* **CPU Usage:** Percentage of processing power utilized by the system.
* **Memory Usage:** Amount of memory currently in use by the system.
* **Number of Processes:** Total number of running processes on the system.
* **Swap Usage:** Amount of swap space currently in use by the system.
* **System CPU usage %:** Percentage of total available processing power utilized by the whole system, i.e. the average across all CPU cores.
* **Process CPU usage %:** CPU used by a single process, where e.g. using 2 CPU cores to full capacity would result in a value of 200%.
* **Memory Usage %:** Percentage of memory currently in use by the system.
* **Swap Usage %:** Percentage of swap space currently in use by the system.

## Custom Dashboards

Expand Down
104 changes: 80 additions & 24 deletions docs/integrations/system_metrics.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,84 @@
By default, **Logfire** does not collect system metrics.
The [`logfire.instrument_system_metrics()`][logfire.Logfire.instrument_system_metrics] method can be used to collect system metrics with **Logfire**, such as CPU and memory usage.

To enable metrics, you need to install the `logfire[system-metrics]` extra:
## Installation

Install `logfire` with the `system-metrics` extra:

{{ install_logfire(extras=['system-metrics']) }}

### Available Metrics
alexmojaki marked this conversation as resolved.
Show resolved Hide resolved

Logfire collects the following system metrics:

* `system.cpu.time`: CPU time spent in different modes.
* `system.cpu.utilization`: CPU utilization in different modes.
* `system.memory.usage`: Memory usage.
* `system.memory.utilization`: Memory utilization in different modes.
* `system.swap.usage`: Swap usage.
* `system.swap.utilization`: Swap utilization
* `system.disk.io`: Disk I/O operations (read/write).
* `system.disk.operations`: Disk operations (read/write).
* `system.disk.time`: Disk time (read/write).
* `system.network.dropped.packets`: Dropped packets (transmit/receive).
* `system.network.packets`: Packets (transmit/receive).
* `system.network.errors`: Network errors (transmit/receive).
* `system.network.io`: Network I/O (transmit/receive).
* `system.network.connections`: Network connections (family/type).
* `system.thread_count`: Thread count.
* `process.runtime.memory`: Process memory usage.
* `process.runtime.cpu.time`: Process CPU time.
* `process.runtime.gc_count`: Process garbage collection count.
## Usage

```py
import logfire

logfire.configure()

logfire.instrument_system_metrics()
```

Then in your project, click on 'Dashboards' in the top bar, click 'New Dashboard', and select 'Basic System Metrics (Logfire)' from the dropdown.

## Configuration

By default, `instrument_system_metrics` collects only the metrics it needs to display the 'Basic System Metrics (Logfire)' dashboard. You can choose exactly which metrics to collect and how much data to collect about each metric. The default is equivalent to this:

```py
logfire.instrument_system_metrics({
'process.runtime.cpu.utilization': None, # (1)!
'system.cpu.simple_utilization': None, # (2)!
'system.memory.utilization': ['available'], # (3)!
'system.swap.utilization': ['used'], # (4)!
})
```

1. `process.runtime.cpu.utilization` will lead to exporting a metric that is actually named `process.runtime.cpython.cpu.utilization` or a similar name depending on the Python implementation used. The `None` value means that there are no fields to configure for this metric. The value of this metric is `[psutil.Process().cpu_percent()](https://psutil.readthedocs.io/en/latest/#psutil.Process.cpu_percent) / 100`, i.e. the fraction of CPU time used by this process, where 1 means using 100% of a single CPU core. The value can be greater than 1 if the process uses multiple cores.
2. The `None` value means that there are no fields to configure for this metric. The value of this metric is `[psutil.cpu_percent()](https://psutil.readthedocs.io/en/latest/#psutil.cpu_percent) / 100`, i.e. the fraction of CPU time used by the whole system, where 1 means using 100% of all CPU cores.
3. The value here is a list of 'modes' of memory. The full list can be seen in the [`psutil` documentation](https://psutil.readthedocs.io/en/latest/#psutil.virtual_memory). `available` is "the memory that can be given instantly to processes without the system going into swap. This is calculated by summing different memory metrics that vary depending on the platform. It is supposed to be used to monitor actual memory usage in a cross platform fashion." The value of the metric is a number between 0 and 1, and subtracting the value from 1 gives the fraction of memory used.
4. This is the fraction of available swap used. The value is a number between 0 and 1.

To collect lots of detailed data about all available metrics, use `logfire.instrument_system_metrics(base='full')`.

!!! warning
The amount of data collected by `base='full'` can be expensive, especially if you have many servers,
and this is easy to forget about. If you enable this, be sure to monitor your usage and costs.

The most expensive metrics are `system.cpu.utilization/time` which collect data for each core and each mode,
and `system.disk.*` which collect data for each disk device. The exact number depends on the machine hardware,
but this can result in hundreds of data points per minute from each instrumented host.

`logfire.instrument_system_metrics(base='full')` is equivalent to:

```py
logfire.instrument_system_metrics({
'system.cpu.simple_utilization': None,
'system.cpu.time': ['idle', 'user', 'system', 'irq', 'softirq', 'nice', 'iowait', 'steal', 'interrupt', 'dpc'],
'system.cpu.utilization': ['idle', 'user', 'system', 'irq', 'softirq', 'nice', 'iowait', 'steal', 'interrupt', 'dpc'],
'system.memory.usage': ['available', 'used', 'free', 'active', 'inactive', 'buffers', 'cached', 'shared', 'wired', 'slab', 'total'],
'system.memory.utilization': ['available', 'used', 'free', 'active', 'inactive', 'buffers', 'cached', 'shared', 'wired', 'slab'],
'system.swap.usage': ['used', 'free'],
'system.swap.utilization': ['used'],
'system.disk.io': ['read', 'write'],
'system.disk.operations': ['read', 'write'],
'system.disk.time': ['read', 'write'],
'system.network.dropped.packets': ['transmit', 'receive'],
'system.network.packets': ['transmit', 'receive'],
'system.network.errors': ['transmit', 'receive'],
'system.network.io': ['transmit', 'receive'],
'system.thread_count': None,
'process.runtime.memory': ['rss', 'vms'],
'process.runtime.cpu.time': ['user', 'system'],
'process.runtime.gc_count': None,
'process.runtime.thread_count': None,
'process.runtime.cpu.utilization': None,
'process.runtime.context_switches': ['involuntary', 'voluntary'],
'process.open_file_descriptor.count': None,
})
```

Each key here is a metric name. The values have different meanings for different metrics. For example, for `system.cpu.utilization`, the value is a list of CPU modes. So there will be a separate row for each CPU core saying what percentage of time it spent idle, another row for the time spent waiting for IO, etc. There are no fields to configure for `system.thread_count`, so the value is `None`.

For convenient customizability, the first dict argument is merged with the base. For example, if you want to collect disk read operations (but not writes) you can write:

- `logfire.instrument_system_metrics({'system.disk.operations': ['read']})` to collect that data in addition to the basic defaults.
- `logfire.instrument_system_metrics({'system.disk.operations': ['read']}, base='full')` to collect detailed data about all metrics, excluding disk write operations.
- `logfire.instrument_system_metrics({'system.disk.operations': ['read']}, base=None)` to collect only disk read operations and nothing else.
3 changes: 3 additions & 0 deletions logfire-api/logfire_api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,8 @@ def instrument_openai(self, *args, **kwargs) -> ContextManager[None]:

def instrument_aiohttp_client(self, *args, **kwargs) -> None: ...

def instrument_system_metrics(self, *args, **kwargs) -> None: ...

def shutdown(self, *args, **kwargs) -> None: ...

DEFAULT_LOGFIRE_INSTANCE = Logfire()
Expand Down Expand Up @@ -158,6 +160,7 @@ def shutdown(self, *args, **kwargs) -> None: ...
instrument_redis = DEFAULT_LOGFIRE_INSTANCE.instrument_redis
instrument_pymongo = DEFAULT_LOGFIRE_INSTANCE.instrument_pymongo
instrument_mysql = DEFAULT_LOGFIRE_INSTANCE.instrument_mysql
instrument_system_metrics = DEFAULT_LOGFIRE_INSTANCE.instrument_system_metrics
shutdown = DEFAULT_LOGFIRE_INSTANCE.shutdown

def no_auto_trace(x):
Expand Down
3 changes: 2 additions & 1 deletion logfire-api/logfire_api/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ from .integrations.logging import LogfireLoggingHandler as LogfireLoggingHandler
from .integrations.structlog import LogfireProcessor as StructlogProcessor
from .version import VERSION as VERSION

__all__ = ['Logfire', 'LogfireSpan', 'LevelName', 'ConsoleOptions', 'PydanticPlugin', 'configure', 'span', 'instrument', 'log', 'trace', 'debug', 'notice', 'info', 'warn', 'error', 'exception', 'fatal', 'force_flush', 'log_slow_async_callbacks', 'install_auto_tracing', 'instrument_fastapi', 'instrument_openai', 'instrument_anthropic', 'instrument_asyncpg', 'instrument_httpx', 'instrument_celery', 'instrument_requests', 'instrument_psycopg', 'instrument_django', 'instrument_flask', 'instrument_starlette', 'instrument_aiohttp_client', 'instrument_sqlalchemy', 'instrument_redis', 'instrument_pymongo', 'instrument_mysql', 'AutoTraceModule', 'with_tags', 'with_settings', 'shutdown', 'load_spans_from_file', 'no_auto_trace', 'METRICS_PREFERRED_TEMPORALITY', 'ScrubMatch', 'ScrubbingOptions', 'VERSION', 'suppress_instrumentation', 'StructlogProcessor', 'LogfireLoggingHandler', 'TailSamplingOptions']
__all__ = ['Logfire', 'LogfireSpan', 'LevelName', 'ConsoleOptions', 'PydanticPlugin', 'configure', 'span', 'instrument', 'log', 'trace', 'debug', 'notice', 'info', 'warn', 'error', 'exception', 'fatal', 'force_flush', 'log_slow_async_callbacks', 'install_auto_tracing', 'instrument_fastapi', 'instrument_openai', 'instrument_anthropic', 'instrument_asyncpg', 'instrument_httpx', 'instrument_celery', 'instrument_requests', 'instrument_psycopg', 'instrument_django', 'instrument_flask', 'instrument_starlette', 'instrument_aiohttp_client', 'instrument_sqlalchemy', 'instrument_redis', 'instrument_pymongo', 'instrument_mysql', 'instrument_system_metrics', 'AutoTraceModule', 'with_tags', 'with_settings', 'shutdown', 'load_spans_from_file', 'no_auto_trace', 'METRICS_PREFERRED_TEMPORALITY', 'ScrubMatch', 'ScrubbingOptions', 'VERSION', 'suppress_instrumentation', 'StructlogProcessor', 'LogfireLoggingHandler', 'TailSamplingOptions']

DEFAULT_LOGFIRE_INSTANCE = Logfire()
span = DEFAULT_LOGFIRE_INSTANCE.span
Expand All @@ -35,6 +35,7 @@ instrument_sqlalchemy = DEFAULT_LOGFIRE_INSTANCE.instrument_sqlalchemy
instrument_redis = DEFAULT_LOGFIRE_INSTANCE.instrument_redis
instrument_pymongo = DEFAULT_LOGFIRE_INSTANCE.instrument_pymongo
instrument_mysql = DEFAULT_LOGFIRE_INSTANCE.instrument_mysql
instrument_system_metrics = DEFAULT_LOGFIRE_INSTANCE.instrument_system_metrics
shutdown = DEFAULT_LOGFIRE_INSTANCE.shutdown
with_tags = DEFAULT_LOGFIRE_INSTANCE.with_tags
with_settings = DEFAULT_LOGFIRE_INSTANCE.with_settings
Expand Down
Loading