Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the code to handle divide by zero panic #74676

Closed

Conversation

hpandeycodeit
Copy link

What type of PR is this?

/kind bug

What this PR does / why we need it:

This fixes the bug where there can be a Panic when nanoSeconds value is zero

Which issue(s) this PR fixes:

Fixes #74667

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hpandeycodeit
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

If they are not already assigned, you can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @hpandeycodeit. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2019
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2019
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 27, 2019
@adelina-t
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2019
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/kind bug
/priority important-soon

/assing @derekwaynecarr @feiskyer

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 27, 2019
@neolit123
Copy link
Member

/retest

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feiskyer
Copy link
Member

@hpandeycodeit thanks for the fix.

@PatrickLang
Copy link
Contributor

/sig windows

@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Feb 28, 2019
@@ -545,6 +545,9 @@ func (p *criStatsProvider) getContainerUsageNanoCores(stats *runtimeapi.Containe
}

nanoSeconds := stats.Cpu.Timestamp - cached.Timestamp
if nanoSeconds == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this ever occur? Does this mean two calls to the CRI stats gave metrics with the same timestamp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ cc @adelina-t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this on Windows nodes too, but not sure yet how this could've happened.
dockershim adds the timestamp itself for cpu usage by calling time.Now(), so it should not have the same timestamp for two calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dashpole that this is just papering over the actual bug. Given that, I'd suggest:

  • Log an error message including the content of the new stats and the cache stats for further debugging
  • return nil in this case to surface the problem (but not crashing kubelet of course).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddebroy any ideas on why this could have happened?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the proposal to log the timestamps. The values are populated with time.Now().UnixNano() at https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/docker_stats_windows.go#L89 . That is a core golang API.

The golang docs specify using sub() to measure duration: https://golang.org/pkg/time/ because that uses monotonic time. This https://go.googlesource.com/proposal/+/master/design/12914-monotonic.md describes the proposal to shift to using monotonic times from wall clock and specifically mentions the usage of sub(). VMs may be prone to wall clock drifts. So maybe it's safer to use sub() and then converting the duration to nanoseconds?

Copy link
Member

@neolit123 neolit123 Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go timers are lower resolution on Windows.
(not to mention i've seen wonky behavior in the past).

if the existing process relies on nano precision, it has to be adapted for Windows i.e. consider introducing small delays if needed.

+1 for the proposal to log the timestamps

+1 here as well.

to illustrate what i mean above try comparing this between Windows and Linux:

package main

import (
	"fmt"
	"time"
	"math/rand"
)

func load() {
	for i := 0; i < rand.Intn(10); i++ {
		fmt.Println("i=", i)		
	}	
}

func main() {
	for i := 0; i < 10; i++ {
		t1 := time.Now()
		load()		
		t2 := time.Now()
		fmt.Println("-----------> diff", (t2.Sub(t1)).Nanoseconds())
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - I was going down the same path - if this isn't using time.sub() is that the problem?

I tried to track down the accuracy of time.Now()'s monotonic clock on Windows but ended up at a dead end with runtimeNano. If anyone has a pointer to the C or ASM code for this, please share. Windows does have both monotonic and wall time down to microsecond link

Copy link
Member

@neolit123 neolit123 Mar 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it might be worth a shot to force an explicit proc-ed QueryPerformanceCounter and don't rely on Now() or Sub(), to see if the behavior changes:
https://github.com/ScaleFT/monotime/blob/master/clock_windows_qpf.go

@yujuhong yujuhong self-assigned this Feb 28, 2019
@@ -545,6 +545,9 @@ func (p *criStatsProvider) getContainerUsageNanoCores(stats *runtimeapi.Containe
}

nanoSeconds := stats.Cpu.Timestamp - cached.Timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log the timestamps when the nanoSeconds is 0 so that we can see what their actual values are?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant log both stats.Cpu.Timestamp and cached.Timestamp

@adelina-t
Copy link
Contributor

adelina-t commented Mar 1, 2019

@dashpole @yujuhong @neolit123 @michmike @ddebroy

I've actually run with logs for stats.Cpu.Timestamp and cached.Timestamp.

See kubelet logs here: https://www.dropbox.com/s/4fu5jyd021vr3xn/clogs.zip?dl=0

Logs messages are error level and begin with "###"
NOTE: There are messages in there: "Cached state before update" and "Cached state after update", they have the same value because of an error in logging I overlooked :( The messages we care about start after: "### After retrieving from cache L551"

This node from which I took the logs had only one kubelet panic. Relevant logs:

cri_stats_provider.go:552] ### After retrieving from cache L551
./custom-logs\kubelet.err-20190227T233039.766.log-1145-E0227 22:28:11.637629    3964 cri_stats_provider.go:553] ### Cached var value &CpuUsage{Timestamp:1551306491604643800,UsageCoreNanoSeconds:&UInt64Value{Value:0,},}
./custom-logs\kubelet.err-20190227T233039.766.log-1146-E0227 22:28:11.637629    3964 cri_stats_provider.go:560] ### stats.Cpu.Timestamp , cached.Timestamp 1551306491604643800  1551306491604643800
./custom-logs\kubelet.err-20190227T233039.766.log:1147:E0227 22:28:11.637629    3964 cri_stats_provider.go:561] ### nanoSeconds var 0 and uint64 0

@neolit123
Copy link
Member

thanks for the logs @adelina-t

[1] looking at the common nanoSeconds var values there seem to be fair amount of time between cached and current.

./custom-logs\kubelet.err-20190227T233039.766.log-1146-E0227 22:28:11.637629 3964 cri_stats_provider.go:560] ### stats.Cpu.Timestamp , cached.Timestamp 1551306491604643800 1551306491604643800

the above two identical timestamps still could be the result of condensed execution, such as the reduced precision on Windows is not accounting for the actual difference. but given [1] this might not be the case.

this might not happen on Linux and something like the following can be observed:
1551306491604643892 1551306491604643871
where 92 > 71, but we still have enough precision to not have a DBZ.

i wouldn't discard a different type of bug at play here.

@adelina-t
Copy link
Contributor

As to how this may happen? Take a look in context. ( I put some messages inline but they won't bold)

2019/02/27 22:28:11 [Telemetry] &{IsNewInstance:true CniSucceeded:false Name:CNI OSVersion:v1.0.17 ErrorMessage: Context:AzureCNI SubContext: VnetAddressSpace:[] OrchestratorDetails:0xc000096d20 OSDetails:0xc000082500 SystemDetails:0xc00004e480 InterfaceDetails:0xc000152150 BridgeDetails:<nil> Metadata:{Location:southcentralus VMName:2277k8s000 Offer:WindowsServerSemiAnnual OsType:Windows PlacementGroupID: PlatformFaultDomain:0 PlatformUpdateDomain:0 Publisher:MicrosoftWindowsServer ResourceGroupName:pull-kubernetes-build-cni-custom-log-1296ef50-3ac7-11e9-a51d-a223bd5424ad Sku:Datacenter-Core-1809-with-Containers-smalldisk SubscriptionID:7fe76de7-a6e6-491a-b482-449cec7c91fd Tags:aksEngineVersion:v0.31.1;creationSource:aksengine-2277k8s000;orchestrator:Kubernetes:1.14.0-alpha.1;poolName:windowspool1;resourceNameSuffix:22774 OSVersion:1809.0.20190214 VMID:d5f43296-e27d-4284-afae-b17cf10edd44 VMSize:Standard_D2s_v3 KernelVersion:}}
E0227 22:28:11.587668    3964 cni.go:331] Error adding gc-5524_simpletest.deployment-7d946ddf8d-2592r/0bd01fa1cfe7d4c71ac607c4c21c5d9c8066e2b3b41e8f399b15e3d0dbee3c5e to network azure-vnet/azure: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"The requested virtual machine or container operation is not valid in the current state.\"\n}": invalid character '{' after top-level value
E0227 22:28:11.587668    3964 cni_windows.go:50] error while adding to cni network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"The requested virtual machine or container operation is not valid in the current state.\"\n}": invalid character '{' after top-level value
W0227 22:28:11.587668    3964 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "simpletest.deployment-7d946ddf8d-2592r_gc-5524": netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"The requested virtual machine or container operation is not valid in the current state.\"\n}": invalid character '{' after top-level value


E0227 22:28:11.635653    3964 cri_stats_provider.go:535] ### Stats Object &ContainerStats{Attributes:&ContainerAttributes{Id:0bd01fa1cfe7d4c71ac607c4c21c5d9c8066e2b3b41e8f399b15e3d0dbee3c5e,Metadata:&ContainerMetadata{Name:nginx,Attempt:0,},Labels:map[string]string{io.kubernetes.container.name: nginx,io.kubernetes.pod.name: simpletest.deployment-7d946ddf8d-2592r,io.kubernetes.pod.namespace: gc-5524,io.kubernetes.pod.uid: de118384-3ade-11e9-8591-000d3a75ff52,},Annotations:map[string]string{io.kubernetes.container.hash: b0dc0c6e,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 0,},},Cpu:&CpuUsage{Timestamp:1551306491604643800,UsageCoreNanoSeconds:&UInt64Value{Value:0,},},Memory:&MemoryUsage{Timestamp:1551306491604643800,WorkingSetBytes:&UInt64Value{Value:0,},},WritableLayer:&FilesystemUsage{Timestamp:1551306491604643800,FsId:&FilesystemIdentifier{Mountpoint:C:\ProgramData\docker,},UsedBytes:&UInt64Value{Value:0,},InodesUsed:nil,},}
E0227 22:28:11.635653    3964 cri_stats_provider.go:536] **### Entering  getContainerUsageNanoCores.** **FIST CALL**
E0227 22:28:11.635653    3964 cri_stats_provider.go:552] ### After retrieving from cache L551
E0227 22:28:11.635653    3964 cri_stats_provider.go:553] ### Cached var value &CpuUsage{Timestamp:1551306478430587800,UsageCoreNanoSeconds:&UInt64Value{Value:2625000000,},}
E0227 22:28:11.635653    3964 cri_stats_provider.go:560] ### stats.Cpu.Timestamp , cached.Timestamp **1551306491604643800**  <- ACTUAL VALUE THAT BECOMES CACHED 1551306478430587800
E0227 22:28:11.635653    3964 cri_stats_provider.go:561] ### nanoSeconds var 13174056000 and uint64 13174056000


E0227 22:28:11.636629    3964 cri_stats_provider.go:535] ### Stats Object &ContainerStats{Attributes:&ContainerAttributes{Id:d65b191f06cdbee7fb325f678137063011d69d32dd1f6ea47f0fc5c5ed92babc,Metadata:&ContainerMetadata{Name:nginx,Attempt:0,},Labels:map[string]string{io.kubernetes.container.name: nginx,io.kubernetes.pod.name: test-cleanup-controller-fxfpd,io.kubernetes.pod.namespace: deployment-9395,io.kubernetes.pod.uid: 999a1c85-3ade-11e9-8591-000d3a75ff52,},Annotations:map[string]string{io.kubernetes.container.hash: 4d469ef7,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 0,},},Cpu:&CpuUsage{Timestamp:1551306491634628500,UsageCoreNanoSeconds:&UInt64Value{Value:0,},},Memory:&MemoryUsage{Timestamp:1551306491634628500,WorkingSetBytes:&UInt64Value{Value:0,},},WritableLayer:&FilesystemUsage{Timestamp:1551306491634628500,FsId:&FilesystemIdentifier{Mountpoint:C:\ProgramData\docker,},UsedBytes:&UInt64Value{Value:0,},InodesUsed:nil,},}
E0227 22:28:11.636629    3964 cri_stats_provider.go:536] ### Entering getContainerUsageNanoCores. **SECOND CALL**
E0227 22:28:11.636629    3964 cri_stats_provider.go:552] ### After retrieving from cache L551
E0227 22:28:11.636629    3964 cri_stats_provider.go:553] ### Cached var value &CpuUsage{Timestamp:1551306478437584400,UsageCoreNanoSeconds:&UInt64Value{Value:0,},}
E0227 22:28:11.636629    3964 cri_stats_provider.go:560] ### stats.Cpu.Timestamp , cached.Timestamp 1551306491634628500  1551306478437584400
E0227 22:28:11.636629    3964 cri_stats_provider.go:561] ### nanoSeconds var 13197044100 and uint64 13197044100
E0227 22:28:11.636629    3964 cri_stats_provider.go:659] Unable to fetch container log stats for path \var\log\pods\999a1c85-3ade-11e9-8591-000d3a75ff52\nginx: The system cannot find the path specified.


E0227 22:28:11.637629    3964 cri_stats_provider.go:535] ### Stats Object &ContainerStats{Attributes:&ContainerAttributes{Id:0bd01fa1cfe7d4c71ac607c4c21c5d9c8066e2b3b41e8f399b15e3d0dbee3c5e,Metadata:&ContainerMetadata{Name:nginx,Attempt:0,},Labels:map[string]string{io.kubernetes.container.name: nginx,io.kubernetes.pod.name: simpletest.deployment-7d946ddf8d-2592r,io.kubernetes.pod.namespace: gc-5524,io.kubernetes.pod.uid: de118384-3ade-11e9-8591-000d3a75ff52,},Annotations:map[string]string{io.kubernetes.container.hash: b0dc0c6e,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 0,},},Cpu:&CpuUsage{Timestamp:1551306491604643800,UsageCoreNanoSeconds:&UInt64Value{Value:0,},},Memory:&MemoryUsage{Timestamp:1551306491604643800,WorkingSetBytes:&UInt64Value{Value:0,},},WritableLayer:&FilesystemUsage{Timestamp:1551306491604643800,FsId:&FilesystemIdentifier{Mountpoint:C:\ProgramData\docker,},UsedBytes:&UInt64Value{Value:0,},InodesUsed:nil,},}
E0227 22:28:11.637629    3964 cri_stats_provider.go:536] **### Entering getContainerUsageNanoCores.**  **THIRD CALL AND PANIC**
E0227 22:28:11.637629    3964 cri_stats_provider.go:552] ### After retrieving from cache L551
E0227 22:28:11.637629    3964 cri_stats_provider.go:553] ### Cached var value &CpuUsage{Timestamp:**1551306491604643800**,UsageCoreNanoSeconds:&UInt64Value{Value:0,},}
E0227 22:28:11.637629    3964 cri_stats_provider.go:560] ### stats.Cpu.Timestamp , cached.Timestamp 1551306491604643800  ** 1551306491604643800** <- CACHED VALUE
E0227 22:28:11.637629    3964 cri_stats_provider.go:561] ### nanoSeconds var 0 and uint64 0

FIST CALL to getContainerUsageNanoCores is done with an actual value of: 1551306491604643800 and cached value: 1551306478430587800 . When it finishes it should write 1551306491604643800 to cache ( defer ).

SECOND CALL to getContainerUsageNanoCores is done with an actual value of:
1551306491634628500 and cached value of 1551306478437584400. So it stands to reason that by this time, FIRST call did not manage to write to cache.

THIRD CALL to getContainerUsageNanoCores is done with an actual value of:
1551306491604643800 and cached value of 1551306491604643800 thus panic. It clear that by this time FIRST CALL managed to write to cache as usual. The problem is that THIRD CALLs actual value is
less than the SECOND CALLs actual value. Which means that between the moment the stats for the third call where collected and the actual call to getContainerUsageNanoCores there was significant delay somewhere. This conclusion is consistent with the fact that through the logs I see negative values for nanoSeconds:

./custom-logs\kubelet.err-20190227T233039.766.log-9-E0227 21:52:06.659021    3964 cri_stats_provider.go:560] ### stats.Cpu.Timestamp , cached.Timestamp 1551304326654025700  1551304326655073800
./custom-logs\kubelet.err-20190227T233039.766.log:10:E0227 21:52:06.659021    3964 cri_stats_provider.go:561] ### nanoSeconds var -1048100 and uint64 18446744073708503516
"docker://3204048bbc0ea4bc74b91786e171fc72e4e5bcffb1f82b73377c2239d52d8660" (ss-2_statefulset-17(acbeea56-3ad9-11e9-8591-000d3a75ff52):nginx)

@neolit123
Copy link
Member

neolit123 commented Mar 1, 2019

The problem is that THIRD CALLs actual value is
less than the SECOND CALLs actual value. Which means that between the moment the stats for the third call where collected and the actual call to getContainerUsageNanoCores there was significant delay somewhere. This conclusion is consistent with the fact that through the logs I see negative values for nanoSeconds:

ok, the negative DIFF confirms there is a bug outside of the Windows limited timer precision.
now the question is - why are we seeing this on Windows only.

@adelina-t
Copy link
Contributor

adelina-t commented Mar 1, 2019

@neolit123

I only did a grep trough 2 of the last conformance runs on master on gce ( wget on individual kubelet logs is a pain ) and have not seen this. But that doesn't mean anything, as a kubelet restart was only noticed in our case because the CNI blocked ( a happy coincidence for finding this bug ). So it could happen on linux and nobody will notice a kubelet restart as it doesn't affect tests. If someone with access to those logs could pull more logs and grep trough them, that would help.

Actually what I think happens is that one of the callers of
getContainerUsageNanoCores ( it gets called from a bunch of places ) is slower to get the stats . That's why we get stats so old in relatively recent calls for getContainerUsageNanoCores.

I think the whole design of this function is not the best, as it's clear that the underlying assumption that the timestamp in stats will always be newer than the one in the cache, which we see is not the case.

I'll actually give it a bunch of runs on linux too, with logs to see what happens.

@adelina-t
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2019
@neolit123
Copy link
Member

I think the whole design of this function is not the best, as it's clear that the underlying assumption that the timestamp in stats will always be newer than the one in the cache, which we see is not the case.

i agree.

I'll actually give it a bunch of runs on linux too, with logs to see what happens.

thanks.

waiting on comments from @apelisse @yujuhong for the above too.

@adelina-t
Copy link
Contributor

adelina-t commented Mar 1, 2019

@neolit123 Though one thing really bugs me. So what happens is this: CALLER1 gathers stats and calls getContainerUsageNanoCores with those stats. CALLER2 gathers stats at the exact same time and calls getContainerUsageNanoCores with those stats. One of these calls updates cache, the other one comes with the exact same value and well, divide by 0 happens.

However: is it resonable to think that two callers gathered stats at the exact same time? Or maybe the function gets called from two places with the exact same stats object from two places? ( I don't know how stats are collected, need to investigate that. ). Though, upon further though, if it was the same stat object, we would have seen this more often.

@neolit123
Copy link
Member

neolit123 commented Mar 1, 2019

However: is it resonable to think that two callers gathered stats at the exact same time? Or maybe the function gets called from two places with the exact same stats object from two places? ( I don't know how stats are collected, need to investigate that. ). Though, upon further though, if it was the same stat object, we would have seen this more often

i think it's unlikely that they get the stats at the exact same time even with the reduced precision on Windows and given the negative DIFF problem, so i'd say the same stats object is a possibility too.

i'm hoping that we don't hit compiler bugs and such, because that will be a bummer.

@yujuhong
Copy link
Contributor

yujuhong commented Mar 2, 2019

However: is it resonable to think that two callers gathered stats at the exact same time? Or maybe the function gets called from two places with the exact same stats object from two places? ( I don't know how stats are collected, need to investigate that. ). Though, upon further though, if it was the same stat object, we would have seen this more often.

@adelina-t I checked the code briefly and I don't think the stats should be the same. If you print the actual cpu usage(s), we can check whether they are the same.

@neolit123 Though one thing really bugs me. So what happens is this: CALLER1 gathers stats and calls getContainerUsageNanoCores with those stats. CALLER2 gathers stats at the exact same time and calls getContainerUsageNanoCores with those stats. One of these calls updates cache, the other one comes with the exact same value and well, divide by 0 happens.

Yes, I read the code again and this is not ideal, and may lead to more confusion.

Caching and calculating the cpu usage for a single caller makes sense, since it controls the interval. Using a global cache for multiple callers would lead to inconsistent period, or even out of order stats (which can be mitigated by some tricks, but those doesn't tackle the main issue).

I think we need a better way to handle this.

@michmike
Copy link
Contributor

michmike commented Mar 2, 2019

@yujuhong are you suggesting that we need a better way to handle this for 1.14, or post 1.14? I agree we should add some more logging around the actual CPU statistics in the cached object to see what they look like.

the CPU statistics could help shed some light if this is a caching issue or simply a time recording issue.

i want to ask a different question. there are two problematic scenarios

  1. what's the impact if return NIL in the case where the timestamps are identical?
  2. if the cached timestamp is a later date, we can reverse the logic and that should give us usageNanoCores as well. ultimately we are computing usage across two timestamps. which one is first should not matter. Do you agree?

@yujuhong
Copy link
Contributor

yujuhong commented Mar 5, 2019

@michmike the current logic is inherently flawed since the callers could get cpu usage over arbitrary, inconsistent intervals (e.g., 1s, 10s, 60s).
I have an alternative fix #74933, which still has issues, but I think it's better than simply reversing the timestamps or returning nil.

@michmike
Copy link
Contributor

michmike commented Mar 5, 2019

closing this PR since #74933 addresses this issue
/close

@k8s-ci-robot
Copy link
Contributor

@michmike: Closed this PR.

In response to this:

closing this PR since #74933 addresses this issue
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Divide by zero panic in getContainerUsageNanocores
10 participants