add disk io metric #1450

andyxning · 2017-01-09T13:04:27Z

Address #690.

This PR will add disk io metrics to heapster. It will add these metrics:

disk/io_read_bytes: Number of bytes read from a disk partition
disk/io_write_bytes: Number of bytes written to a disk partition
disk/io_read_bytes_rate: Number of bytes read from a disk partition per second
disk/io_write_bytes_rate: Number of bytes written to a disk partition per second

all these metrics will add resource_id label with values in major:minor format.

FYI：kernel blkio cgroup

k8s-reviewable · 2017-01-09T13:04:32Z

This change is

andyxning · 2017-01-09T15:21:58Z

@piosz @DirectXMan12 @mwielgus PTAL.

piosz · 2017-01-13T08:48:50Z

@timstclair @vishh can one of you take a look? Does it make sense to have those metrics?

vishh · 2017-01-13T20:14:13Z

In on-prem clusters, disk is often a contended resource. We do not provide any isolation for disk yet. The best we can do is to expose usage metrics. Disk IO is a common issue and exposing that metrics to users will enable them to identify offending pods and take action. So I'd recommend adding metrics for "space", "inodes" and "IO" for disk

…

On Fri, Jan 13, 2017 at 12:48 AM, Piotr Szczesniak ***@***.*** > wrote: @timstclair <https://github.com/timstclair> @vishh <https://github.com/vishh> can one of you take a look? Does it make sense to have those metrics? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1450 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKFT_w-7tp_eSX_LeMdlS9Q7eukwQks5rRzp0gaJpZM4LePK3> .

andyxning · 2017-02-21T09:31:21Z

Sorry for the delay.

@vishh

So I'd recommend adding metrics for "space", "inodes" and "IO" for disk

There already exists filesystem/usage, filesystemd/limit and filesystem/available metrics for space.
As for inodes, it seems that blkio cgroup has no info about this. I have found that inodes info is available in cAdvisor FsStats. Will add this in another PR(add filesystem inode metrics #1542).
blkio cgroup supports setting resource Throttling/Upper limit policy for IO. This is what this PR does.

googlebot · 2017-02-21T09:37:29Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that they're okay with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

googlebot · 2017-02-22T10:20:36Z

CLAs look good, thanks!

andyxning · 2017-02-22T12:07:31Z

Comments addressed. Friendly ping @timstclair @vishh

andyxning · 2017-03-07T14:01:28Z

Friendly ping again. :) @timstclair @vishh

piosz · 2017-03-16T20:38:50Z

Removing out of 1.3

dnavre · 2017-07-09T17:06:13Z

This PR will be very useful especially if one runs database shards in kubernetes. Please do merge it :)

weikinhuang · 2017-09-26T22:34:04Z

Has there been any updates on this? When running a self hosted cluster with spindle disks this is very useful information to have.

DirectXMan12 · 2017-10-03T15:44:34Z

it needs to be rebased before it can be merged.

jingxu97 · 2017-10-27T00:23:53Z

@andyxning, will you pick up this PR again?

andyxning · 2017-10-27T02:11:58Z

@jingxu97 Yes. We need this also. I will rebase and resolve the conflicts in next week.

BTW, any thoughts on the implementations or suggestions.

piosz · 2017-12-07T10:26:38Z

@andyxning apologies for the delay. I'm fine with this PR. As @DirectXMan12 wrote please have in mind that /stats endpoint is deprecated in Kubelet and will be removed from there at some point, especially Heapster while using it consumes much more resources.

At the same time it's not deprecated in pure Heapster to some degree - we just need to tune Heapster a bit to work with standalone cadvisor.

piosz · 2017-12-07T10:26:42Z

/lgtm

andyxning · 2017-12-07T15:17:30Z

Thanks @DirectXMan12 @piosz .

Yes, it is quite a question when the cluster becomes more and more large. With the original cAdvisor stats endpoint, more and more redundant data needs to be scraped as the cluster grows. This is quite a huge overhead.

Btw, the summary api needs to be enhanced to accomplish our basic metrics requirements. :) Will do some works on that. :)

weikinhuang · 2017-12-13T17:42:36Z

How do I access this in v1.5.0 within influxdb/grafana? When I try to add a new dashboard graph, that field seems to be missing:

andyxning · 2017-12-14T02:26:40Z

@weikinhuang

disk/io_read_bytes_rate: Number of bytes read from a disk partition per second
disk/io_write_bytes_rate: Number of bytes written to a disk partition per second

This needs legacy source for kubelet instead of the summary one.

weikinhuang · 2017-12-14T02:27:40Z

I don't see them in the recorded metrics

andyxning · 2017-12-14T16:06:26Z

@weikinhuang How is your configuration for --source?

weikinhuang · 2017-12-14T16:11:00Z

@andyxning Here is my deployment spec for heapster (mostly from the kubernetes/addons dir)

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: heapster-v1.5.0
  namespace: kube-system
  labels:
    k8s-app: heapster
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    version: v1.5.0
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: heapster
      version: v1.5.0
  template:
    metadata:
      labels:
        k8s-app: heapster
        version: v1.5.0
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      containers:
        - image: gcr.io/google_containers/heapster-amd64:v1.5.0
          name: heapster
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8082
              scheme: HTTP
            initialDelaySeconds: 180
            timeoutSeconds: 5
          command:
            - /heapster
            - --source=kubernetes.summary_api:''
            - --sink=influxdb:http://monitoring-influxdb:8086
        - image: gcr.io/google_containers/heapster-amd64:v1.5.0
          name: eventer
          command:
            - /eventer
            - --source=kubernetes:''
            - --sink=influxdb:http://monitoring-influxdb:8086
        - image: gcr.io/google_containers/addon-resizer:1.7
          name: heapster-nanny
          resources:
            limits:
              cpu: 50m
              memory: 90Mi
            requests:
              cpu: 50m
              memory: 90Mi
          env:
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: MY_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          command:
            - /pod_nanny
            - --cpu=80m
            - --extra-cpu=0.5m
            - --memory=140Mi
            - --extra-memory=4Mi
            - --deployment=heapster-v1.5.0
            - --container=heapster
            - --poll-period=300000
        - image: gcr.io/google_containers/addon-resizer:1.7
          name: eventer-nanny
          resources:
            limits:
              cpu: 50m
              memory: 200Mi
            requests:
              cpu: 50m
              memory: 200Mi
          env:
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: MY_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          command:
            - /pod_nanny
            - --cpu=100m
            - --extra-cpu=0m
            - --memory=190Mi
            - --extra-memory=500Ki
            - --deployment=heapster-v1.5.0
            - --container=eventer
            - --poll-period=300000
      serviceAccountName: heapster
      tolerations:
        - key: "CriticalAddonsOnly"
          operator: "Exists"

weikinhuang · 2017-12-14T17:47:48Z

Ah. It looks like I'm using the summary API. Which does not have these metrics. Is there any issue or pr I can subscribe to for disk metrics in the summary API? Thanks @andyxning

andyxning · 2017-12-15T00:38:45Z

@weikinhuang As for now there is not pr for adding disk io metric in summary api. I am planning to add this in next week. If you can help and make a PR about this, i will be ok to review it. :)

weikinhuang · 2017-12-15T00:40:39Z

I would love to help, but my knowledge of golang's syntax is limited.

andyxning · 2017-12-15T02:10:11Z

@weikinhuang To that situation, i suggest to wait for another week maybe for this feature added to summary api source. I will give it a try.

weikinhuang · 2017-12-15T02:12:27Z

No problem, thanks @andyxning!

zliang-min · 2018-01-25T19:57:01Z

@andyxning just wondering that are you still working on adding disk metrics to summary api source?

zliang-min · 2018-01-25T20:45:59Z

I use "kubernetes" source and still am not seeing any disk metrics, any idea? I'm using gcr.io/google_containers/heapster:v1.5.0 running on kubernetes 1.8.6.

andyxning · 2018-01-26T04:07:16Z

Any logs for heapster?

andyxning · 2018-01-26T04:08:09Z

just wondering that are you still working on adding disk metrics to summary api source?

Yes, i will work on this. But no time recently available working on this. :(
Maybe one week later.

zliang-min · 2018-01-26T17:44:41Z

@andyxning no error nor warning message from heapster's log.

andyxning · 2018-01-27T06:49:31Z

@Gimi Which sink backend do you use?

zliang-min · 2018-02-02T22:15:16Z

@andyxning sorry for the late response, I use statsd.

andyxning · 2018-02-07T05:54:21Z

@Gimi Just have thought that do you use the summary data source? The disk io is currently only available on legacy data source. The summary api is used when start with heapster with --source=kubernetes.summary_api:''

andyxning · 2018-02-07T06:10:41Z

@Gimi I have test master branch with statsd sink and legacy data source and the disk io metrics is available like:

node.192_168_0_1.batch.sandbox;beta_kubernetes_io/arch.disk/io_read_bytes_rate./dev/sdb:15381.686|g
node.192_168_0_1.batch.sandbox;beta_kubernetes_io/arch.disk/io_read_bytes_rate./dev/sda:-1.3200837e+09|g
node.192_168_0_1.batch.sandbox;beta_kubernetes_io/arch.disk/io_write_bytes_rate./dev/sdb:2.1783398e+06|g
node.192_168_0_1.batch.sandbox;beta_kubernetes_io/arch.disk/io_write_bytes_rate./dev/sda:-5.901337e+10|g

zliang-min · 2018-02-07T17:52:14Z

thanks @andyxning for checking. I at the beginning did use the summary api resource, but I switched back to the legacy one when I found this ticket. But still did not see any disk metrics. I'll check again and see if I missed anything. Thank you.

googlebot added the cla: yes label Jan 9, 2017

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 9, 2017

andyxning force-pushed the add_disk_io_metrics branch 2 times, most recently from aaaec54 to 9403449 Compare January 9, 2017 15:02

DirectXMan12 added the enhancement label Jan 23, 2017

DirectXMan12 assigned vishh Jan 23, 2017

googlebot added cla: no and removed cla: yes labels Feb 21, 2017

andyxning force-pushed the add_disk_io_metrics branch from c3f0a98 to 0fa1ccb Compare February 22, 2017 10:20

googlebot added cla: yes and removed cla: no labels Feb 22, 2017

DirectXMan12 added this to the v1.3 milestone Mar 9, 2017

piosz removed this from the v1.3 milestone Mar 16, 2017

This was referenced Nov 2, 2017

Expose Storage Metrics kubernetes/enhancements#363

Closed

Support disk io requests and limits kubernetes/kubernetes#54923

Open

andyxning force-pushed the add_disk_io_metrics branch from e5f1618 to 3ffb7af Compare December 6, 2017 15:16

piosz added this to the v1.5 milestone Dec 7, 2017

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2017

piosz merged commit 02692f8 into kubernetes-retired:master Dec 7, 2017

andyxning deleted the add_disk_io_metrics branch December 7, 2017 15:12

cblecker unassigned vishh and piosz Nov 30, 2018

add disk io metric #1450

add disk io metric #1450

Conversation

andyxning commented Jan 9, 2017 • edited Loading

k8s-reviewable commented Jan 9, 2017

andyxning commented Jan 9, 2017

piosz commented Jan 13, 2017

vishh commented Jan 13, 2017 via email

andyxning commented Feb 21, 2017 • edited Loading

googlebot commented Feb 21, 2017

googlebot commented Feb 22, 2017

andyxning commented Feb 22, 2017

andyxning commented Mar 7, 2017

piosz commented Mar 16, 2017

dnavre commented Jul 9, 2017

weikinhuang commented Sep 26, 2017

DirectXMan12 commented Oct 3, 2017

jingxu97 commented Oct 27, 2017

andyxning commented Oct 27, 2017

piosz commented Dec 7, 2017

piosz commented Dec 7, 2017

andyxning commented Dec 7, 2017

weikinhuang commented Dec 13, 2017

andyxning commented Dec 14, 2017 • edited Loading

weikinhuang commented Dec 14, 2017

andyxning commented Dec 14, 2017

weikinhuang commented Dec 14, 2017

weikinhuang commented Dec 14, 2017

andyxning commented Dec 15, 2017

weikinhuang commented Dec 15, 2017

andyxning commented Dec 15, 2017

weikinhuang commented Dec 15, 2017

zliang-min commented Jan 25, 2018

zliang-min commented Jan 25, 2018

andyxning commented Jan 26, 2018

andyxning commented Jan 26, 2018

zliang-min commented Jan 26, 2018

andyxning commented Jan 27, 2018

zliang-min commented Feb 2, 2018

andyxning commented Feb 7, 2018 • edited Loading

andyxning commented Feb 7, 2018

zliang-min commented Feb 7, 2018

andyxning commented Jan 9, 2017 •

edited

Loading

andyxning commented Feb 21, 2017 •

edited

Loading

andyxning commented Dec 14, 2017 •

edited

Loading

andyxning commented Feb 7, 2018 •

edited

Loading