Skip to content

Commit

Permalink
psi: introduce psi monitor
Browse files Browse the repository at this point in the history
Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users.  It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.

Time window and threshold are both expressed in usecs.  Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state.  While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window.  Min window size is 500ms, therefore the min
monitoring interval is 50ms.  Max window size is 10s with monitoring
interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
surenbaghdasaryan authored and torvalds committed May 15, 2019
1 parent 8af0c18 commit 0e94682
Show file tree
Hide file tree
Showing 5 changed files with 742 additions and 20 deletions.
107 changes: 107 additions & 0 deletions Documentation/accounting/psi.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,110 @@ as well as medium and long term trends. The total absolute stall time
spikes which wouldn't necessarily make a dent in the time averages,
or to average trends over custom time frames.

Monitoring for pressure thresholds
==================================

Users can register triggers and use poll() to be woken up when resource
pressure exceeds certain thresholds.

A trigger describes the maximum cumulative stall time over a specific
time window, e.g. 100ms of total stall time within any 500ms window to
generate a wakeup event.

To register a trigger user has to open psi interface file under
/proc/pressure/ representing the resource to be monitored and write the
desired threshold and time window. The open file descriptor should be
used to wait for trigger events using select(), poll() or epoll().
The following format is used:

<some|full> <stall amount in us> <time window in us>

For example writing "some 150000 1000000" into /proc/pressure/memory
would add 150ms threshold for partial memory stall measured within
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
would add 50ms threshold for full io stall measured within 1sec time window.

Triggers can be set on more than one psi metric and more than one trigger
for the same psi metric can be specified. However for each trigger a separate
file descriptor is required to be able to poll it separately from others,
therefore for each trigger a separate open() syscall should be made even
when opening the same psi interface file.

Monitors activate only when system enters stall state for the monitored
psi metric and deactivates upon exit from the stall state. While system is
in the stall state psi signal growth is monitored at a rate of 10 times per
tracking window.

The kernel accepts window sizes ranging from 500ms to 10s, therefore min
monitoring update interval is 50ms and max is 1s. Min limit is set to
prevent overly frequent polling. Max limit is chosen as a high enough number
after which monitors are most likely not needed and psi averages can be used
instead.

When activated, psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when system is
bouncing in and out of the stall state.

Notifications to the userspace are rate-limited to one per tracking window.

The trigger will de-register when the file descriptor used to define the
trigger is closed.

Userspace monitor usage example
===============================

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>

/*
* Monitor memory partial stall with 1s tracking window size
* and 150ms threshold.
*/
int main() {
const char trig[] = "some 150000 1000000";
struct pollfd fds;
int n;

fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
if (fds.fd < 0) {
printf("/proc/pressure/memory open error: %s\n",
strerror(errno));
return 1;
}
fds.events = POLLPRI;

if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
printf("/proc/pressure/memory write error: %s\n",
strerror(errno));
return 1;
}

printf("waiting for events...\n");
while (1) {
n = poll(&fds, 1, -1);
if (n < 0) {
printf("poll error: %s\n", strerror(errno));
return 1;
}
if (fds.revents & POLLERR) {
printf("got POLLERR, event source is gone\n");
return 0;
}
if (fds.revents & POLLPRI) {
printf("event triggered!\n");
} else {
printf("unknown event received: 0x%x\n", fds.revents);
return 1;
}
}

return 0;
}

Cgroup2 interface
=================

Expand All @@ -71,3 +175,6 @@ mounted, pressure stall information is also tracked for tasks grouped
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
cpu.pressure, memory.pressure, and io.pressure files; the format is
the same as the /proc/pressure/ files.

Per-cgroup psi monitors can be specified and used the same way as
system-wide ones.
8 changes: 8 additions & 0 deletions include/linux/psi.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include <linux/jump_label.h>
#include <linux/psi_types.h>
#include <linux/sched.h>
#include <linux/poll.h>

struct seq_file;
struct css_set;
Expand All @@ -26,6 +27,13 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);

struct psi_trigger *psi_trigger_create(struct psi_group *group,
char *buf, size_t nbytes, enum psi_res res);
void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t);

__poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#endif

#else /* CONFIG_PSI */
Expand Down
82 changes: 80 additions & 2 deletions include/linux/psi_types.h
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
#ifndef _LINUX_PSI_TYPES_H
#define _LINUX_PSI_TYPES_H

#include <linux/kthread.h>
#include <linux/seqlock.h>
#include <linux/types.h>
#include <linux/kref.h>
#include <linux/wait.h>

#ifdef CONFIG_PSI

Expand Down Expand Up @@ -44,6 +47,12 @@ enum psi_states {
NR_PSI_STATES = 6,
};

enum psi_aggregators {
PSI_AVGS = 0,
PSI_POLL,
NR_PSI_AGGREGATORS,
};

struct psi_group_cpu {
/* 1st cacheline updated by the scheduler */

Expand All @@ -65,7 +74,55 @@ struct psi_group_cpu {
/* 2nd cacheline updated by the aggregator */

/* Delta detection against the sampling buckets */
u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
u32 times_prev[NR_PSI_AGGREGATORS][NR_PSI_STATES]
____cacheline_aligned_in_smp;
};

/* PSI growth tracking window */
struct psi_window {
/* Window size in ns */
u64 size;

/* Start time of the current window in ns */
u64 start_time;

/* Value at the start of the window */
u64 start_value;

/* Value growth in the previous window */
u64 prev_growth;
};

struct psi_trigger {
/* PSI state being monitored by the trigger */
enum psi_states state;

/* User-spacified threshold in ns */
u64 threshold;

/* List node inside triggers list */
struct list_head node;

/* Backpointer needed during trigger destruction */
struct psi_group *group;

/* Wait queue for polling */
wait_queue_head_t event_wait;

/* Pending event flag */
int event;

/* Tracking window */
struct psi_window win;

/*
* Time last event was generated. Used for rate-limiting
* events to one per window
*/
u64 last_event_time;

/* Refcounting to prevent premature destruction */
struct kref refcount;
};

struct psi_group {
Expand All @@ -79,11 +136,32 @@ struct psi_group {
u64 avg_total[NR_PSI_STATES - 1];
u64 avg_last_update;
u64 avg_next_update;

/* Aggregator work control */
struct delayed_work avgs_work;

/* Total stall times and sampled pressure averages */
u64 total[NR_PSI_STATES - 1];
u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1];
unsigned long avg[NR_PSI_STATES - 1][3];

/* Monitor work control */
atomic_t poll_scheduled;
struct kthread_worker __rcu *poll_kworker;
struct kthread_delayed_work poll_work;

/* Protects data used by the monitor */
struct mutex trigger_lock;

/* Configured polling triggers */
struct list_head triggers;
u32 nr_triggers[NR_PSI_STATES - 1];
u32 poll_states;
u64 poll_min_period;

/* Total stall times at the start of monitor activation */
u64 polling_total[NR_PSI_STATES - 1];
u64 polling_next_update;
u64 polling_until;
};

#else /* CONFIG_PSI */
Expand Down
71 changes: 69 additions & 2 deletions kernel/cgroup/cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -3550,7 +3550,65 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
}
#endif

static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, enum psi_res res)
{
struct psi_trigger *new;
struct cgroup *cgrp;

cgrp = cgroup_kn_lock_live(of->kn, false);
if (!cgrp)
return -ENODEV;

cgroup_get(cgrp);
cgroup_kn_unlock(of->kn);

new = psi_trigger_create(&cgrp->psi, buf, nbytes, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
return PTR_ERR(new);
}

psi_trigger_replace(&of->priv, new);

cgroup_put(cgrp);

return nbytes;
}

static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
}

static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
}

static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}

static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
return psi_trigger_poll(&of->priv, of->file, pt);
}

static void cgroup_pressure_release(struct kernfs_open_file *of)
{
psi_trigger_replace(&of->priv, NULL);
}
#endif /* CONFIG_PSI */

static int cgroup_freeze_show(struct seq_file *seq, void *v)
{
Expand Down Expand Up @@ -4745,18 +4803,27 @@ static struct cftype cgroup_base_files[] = {
.name = "io.pressure",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
{
.name = "memory.pressure",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
{
.name = "cpu.pressure",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};

Expand Down
Loading

0 comments on commit 0e94682

Please sign in to comment.