Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate, Simplified Async Pull #137

Merged
merged 46 commits into from
Mar 4, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
941d2ee
rebased updated async implementation onto main, review "//TODO:" entries
imuni4fun Feb 5, 2024
cfc23bd
simplified RunCompletionChecker() loop exit
imuni4fun Feb 5, 2024
c98e7d6
updated logging
imuni4fun Feb 6, 2024
b9b5962
updated chart to use new flag
imuni4fun Feb 6, 2024
d65aef7
removed previous async files
imuni4fun Feb 6, 2024
b6b702d
changed to sync from async for this test which may match original intent
imuni4fun Feb 9, 2024
7a688f2
rebased updated async implementation onto main, review "//TODO:" entries
imuni4fun Feb 5, 2024
c497347
simplified RunCompletionChecker() loop exit
imuni4fun Feb 5, 2024
75b2517
updated logging
imuni4fun Feb 6, 2024
8374809
updated chart to use new flag
imuni4fun Feb 6, 2024
6ff665b
removed previous async files
imuni4fun Feb 6, 2024
fbd31d5
changed to sync from async for this test which may match original intent
imuni4fun Feb 9, 2024
c206d9d
added async puller startup log messages
imuni4fun Feb 9, 2024
fdef232
fixed omission of setting imageSvc
imuni4fun Feb 12, 2024
22b486f
updated ephemeral volume yaml to always pull. not doing this masked f…
imuni4fun Feb 12, 2024
d805aeb
added initial kind doc and Makefile targets, added ephemeral-volume-s…
imuni4fun Feb 12, 2024
7a7b0f0
updated doc
imuni4fun Feb 12, 2024
1aece15
added final steps to fully flush the ephemeral volume and verify asyn…
imuni4fun Feb 12, 2024
6974562
Merge remote-tracking branch 'fork/async-pull-jk' into async-pull-jk
imuni4fun Feb 12, 2024
a937ec6
resolved session state bug, reintroduced metrics, updated metrics test
imuni4fun Feb 14, 2024
89adc0e
fixed elapsed calc
imuni4fun Feb 14, 2024
d31d47d
added removal of pull time info after 1 minute expiration
imuni4fun Feb 14, 2024
b9cd81a
remove unintended Makefile changes
imuni4fun Feb 15, 2024
49207fa
reverted to simple-fs from ubuntu on sample volume yaml
imuni4fun Feb 15, 2024
1531ae6
Merge remote-tracking branch 'origin/main' into async-pull-jk
imuni4fun Feb 15, 2024
84109f4
fixed conflict resolution oversight
imuni4fun Feb 15, 2024
613debd
added async error return unit test
imuni4fun Feb 15, 2024
901c5d2
simplified async implementation further
imuni4fun Feb 16, 2024
74edb81
extracted changes for dev experience PR for issue #143
imuni4fun Feb 16, 2024
074db0e
operation error metric cleanup
imuni4fun Feb 16, 2024
5fad043
introduced image size metric in similar pattern to image pull time
imuni4fun Feb 16, 2024
3921e6c
added pull time and size logs, added error handling to image size che…
imuni4fun Feb 16, 2024
88f6a39
reverted Makefile change... this was extracted to another PR
imuni4fun Feb 21, 2024
62fc0d0
added integration test for async pull feature
imuni4fun Feb 21, 2024
1879256
removed missed status definitions from previous async implementation
imuni4fun Feb 22, 2024
48fcd61
replaced completionChan and loop with completionFunc
imuni4fun Feb 22, 2024
17c1c87
removed isComplete, did not make changes to replace isTimeout and oth…
imuni4fun Feb 22, 2024
8b54915
updated timeout to be more verbose
imuni4fun Feb 22, 2024
b5d6c03
fixed bug where one call to StartAsyncPuller() in node_server.go was …
imuni4fun Feb 22, 2024
b824954
implemented test organization improvements from discussion
imuni4fun Feb 23, 2024
4c6d716
captured comment updates
imuni4fun Feb 26, 2024
a75a0be
implemented change for exposing Image() on PullSession
imuni4fun Feb 28, 2024
6f13aba
added configurability of asyncPullTimeout to helm deployment yaml
imuni4fun Feb 28, 2024
f3cd52d
fixed buG resulting from uniform extraction of image string from dock…
imuni4fun Feb 28, 2024
6c28744
addressed two comments, replaced use of docker.Named.String() with Pu…
imuni4fun Feb 29, 2024
2ec4f6f
Updating Chart.yaml and Makefile
mugdha-adhav Mar 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions charts/warm-metal-csi-driver/templates/nodeplugin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ spec:
{{- if .Values.enableDaemonImageCredentialCache }}
- --enable-daemon-image-credential-cache
{{- end }}
{{- if .Values.enableAsyncPullMount }}
- --async-pull-mount=true
{{- if .Values.enableAsyncPull }}
- --async-pull-timeout=10m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass this timeout value from helm values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @imuni4fun, we decided that we want to keep the enableAsyncPull and adding default timeout value to be passed in from values.yaml. Something like --async-pull-timeout={{ .Values.asyncPullTimeout }}

{{- end }}
- "-v={{ .Values.logLevel }}"
- "--mode=node"
Expand Down
2 changes: 1 addition & 1 deletion charts/warm-metal-csi-driver/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ kubeletRoot: /var/lib/kubelet
snapshotRoot: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
logLevel: 4
enableDaemonImageCredentialCache:
enableAsyncPullMount: false
enableAsyncPull: false
pullImageSecretForDaemonset:

csiPlugin:
Expand Down
6 changes: 3 additions & 3 deletions cmd/plugin/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ var (
enableCache = flag.Bool("enable-daemon-image-credential-cache", true,
"Whether to save contents of imagepullsecrets of the daemon ServiceAccount in memory. "+
"If set to false, secrets will be fetched from the API server on every image pull.")
asyncImagePullMount = flag.Bool("async-pull-mount", false,
"Whether to pull images asynchronously (helps prevent timeout for larger images)")
asyncImagePullTimeout = flag.Duration("async-pull-timeout", 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"If positive, specifies duration allotted for async image pulls as measured from pull start time. If zero, negative, less than 30s, or omitted, the caller's timeout (usually kubelet: 2m) is used instead of this value. (additional time helps prevent timeout for larger images or slower image pull conditions)")
watcherResyncPeriod = flag.Duration("watcher-resync-period", 30*time.Minute, "The resync period of the pvc watcher.")
mode = flag.String("mode", "", "The mode of the driver. Valid values are: node, controller")
nodePluginSA = flag.String("node-plugin-sa", "csi-image-warm-metal", "The name of the ServiceAccount used by the node plugin.")
Expand Down Expand Up @@ -129,7 +129,7 @@ func main() {
server.Start(*endpoint,
NewIdentityServer(driverVersion),
nil,
NewNodeServer(driver, mounter, criClient, secretStore, *asyncImagePullMount))
NewNodeServer(driver, mounter, criClient, secretStore, *asyncImagePullTimeout))
case controllerMode:
watcher, err := watcher.New(context.Background(), *watcherResyncPeriod)
if err != nil {
Expand Down
128 changes: 58 additions & 70 deletions cmd/plugin/node_server.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@ import (
"context"
"os"
"strings"
"time"

"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/containerd/containerd/reference/docker"
"github.com/google/uuid"
"github.com/warm-metal/container-image-csi-driver/pkg/backend"
"github.com/warm-metal/container-image-csi-driver/pkg/metrics"
"github.com/warm-metal/container-image-csi-driver/pkg/mountexecutor"
"github.com/warm-metal/container-image-csi-driver/pkg/mountstatus"
"github.com/warm-metal/container-image-csi-driver/pkg/pullexecutor"
"github.com/warm-metal/container-image-csi-driver/pkg/remoteimage"
"github.com/warm-metal/container-image-csi-driver/pkg/remoteimageasync"
"github.com/warm-metal/container-image-csi-driver/pkg/secret"
csicommon "github.com/warm-metal/csi-drivers/pkg/csi-common"
"google.golang.org/grpc/codes"
Expand All @@ -31,36 +30,33 @@ const (

type ImagePullStatus int

func NewNodeServer(driver *csicommon.CSIDriver, mounter *backend.SnapshotMounter, imageSvc cri.ImageServiceClient, secretStore secret.Store, asyncImagePullMount bool) *NodeServer {
return &NodeServer{
DefaultNodeServer: csicommon.NewDefaultNodeServer(driver),
mounter: mounter,
secretStore: secretStore,
asyncImagePullMount: asyncImagePullMount,
mountExecutor: mountexecutor.NewMountExecutor(&mountexecutor.MountExecutorOptions{
AsyncMount: asyncImagePullMount,
Mounter: mounter,
}),
pullExecutor: pullexecutor.NewPullExecutor(&pullexecutor.PullExecutorOptions{
AsyncPull: asyncImagePullMount,
ImageServiceClient: imageSvc,
SecretStore: secretStore,
Mounter: mounter,
}),
func NewNodeServer(driver *csicommon.CSIDriver, mounter *backend.SnapshotMounter, imageSvc cri.ImageServiceClient, secretStore secret.Store, asyncImagePullTimeout time.Duration) *NodeServer {
ns := NodeServer{
DefaultNodeServer: csicommon.NewDefaultNodeServer(driver),
mounter: mounter,
secretStore: secretStore,
asyncImagePullTimeout: asyncImagePullTimeout,
asyncImagePuller: nil,
}
if asyncImagePullTimeout >= time.Duration(30*time.Second) {
ns.asyncImagePuller = remoteimageasync.StartAsyncPuller(context.TODO(), 100, 20)
} else {
ns.asyncImagePullTimeout = 0 // set to default value
mugdha-adhav marked this conversation as resolved.
Show resolved Hide resolved
}
return &ns
}

type NodeServer struct {
*csicommon.DefaultNodeServer
mounter *backend.SnapshotMounter
secretStore secret.Store
asyncImagePullMount bool
mountExecutor *mountexecutor.MountExecutor
pullExecutor *pullexecutor.PullExecutor
mounter *backend.SnapshotMounter
imageSvc cri.ImageServiceClient
secretStore secret.Store
asyncImagePullTimeout time.Duration
asyncImagePuller remoteimageasync.AsyncPuller
}

func (n NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (resp *csi.NodePublishVolumeResponse, err error) {
valuesLogger := klog.LoggerWithValues(klog.NewKlogr(), "pod-name", req.VolumeContext["pod-name"], "namespace", req.VolumeContext["namespace"], "uid", req.VolumeContext["uid"], "request-id", uuid.NewString())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing the request-id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only place that it was used after this PR. it didn't relate to anything else and wasn't stored for association with future log messages, etc. didn't add any value in current form.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added so that we could track the log messages belonging to a particular NodePublishVolume request.

The NodePublishVolume function call is supposed to be idempotent and the CO (container orchestrator) may call it several times. From CSI spec docs -

If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call NodePublishVolume again

valuesLogger := klog.LoggerWithValues(klog.NewKlogr(), "pod-name", req.VolumeContext["pod-name"], "namespace", req.VolumeContext["namespace"], "uid", req.VolumeContext["uid"])
valuesLogger.Info("Incoming NodePublishVolume request", "request string", req.String())
if len(req.VolumeId) == 0 {
err = status.Error(codes.InvalidArgument, "VolumeId is missing")
Expand Down Expand Up @@ -122,54 +118,51 @@ func (n NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishV
image = req.VolumeContext[ctxKeyImage]
}

namedRef, err := docker.ParseDockerRef(image)
if err != nil {
klog.Errorf("unable to normalize image %q: %s", image, err)
return
}

pullAlways := strings.ToLower(req.VolumeContext[ctxKeyPullAlways]) == "true"

po := &pullexecutor.PullOptions{
Context: ctx,
NamedRef: namedRef,
PullAlways: pullAlways,
Image: image,
PullSecrets: req.Secrets,
Logger: valuesLogger,
}

if e := n.pullExecutor.StartPulling(po); e != nil {
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, e)
keyring, err := n.secretStore.GetDockerKeyring(ctx, req.Secrets)
if err != nil {
err = status.Errorf(codes.Aborted, "unable to fetch keyring: %s", err)
mugdha-adhav marked this conversation as resolved.
Show resolved Hide resolved
return
}

if e := n.pullExecutor.WaitForPull(po); e != nil {
err = status.Errorf(codes.DeadlineExceeded, e.Error())
namedRef, err := docker.ParseDockerRef(image)
if err != nil {
klog.Errorf("unable to normalize image %q: %s", image, err)
return
}

if mountstatus.Get(req.VolumeId) == mountstatus.Mounted {
return &csi.NodePublishVolumeResponse{}, nil
}

o := &mountexecutor.MountOptions{
Context: ctx,
NamedRef: namedRef,
VolumeId: req.VolumeId,
TargetPath: req.TargetPath,
VolumeCapability: req.VolumeCapability,
ReadOnly: req.Readonly,
Logger: valuesLogger,
}

if e := n.mountExecutor.StartMounting(o); e != nil {
err = status.Error(codes.Internal, e.Error())
return
//NOTE: we are relying on n.mounter.ImageExists() to return false when
// a first-time pull is in progress, else this logic may not be
// correct. should test this.
if pullAlways || !n.mounter.ImageExists(ctx, namedRef) {
klog.Errorf("pull image %q", image)
puller := remoteimage.NewPuller(n.imageSvc, namedRef, keyring)

if n.asyncImagePuller != nil {
var session remoteimageasync.PullSession
session, err = n.asyncImagePuller.StartPull(image, puller, n.asyncImagePullTimeout)
if err != nil {
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, err)
return
}
if err = n.asyncImagePuller.WaitForPull(session, ctx); err != nil {
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, err)
return
}
} else {
if err = puller.Pull(ctx); err != nil {
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, err)
return
}
}
}

if e := n.mountExecutor.WaitForMount(o); e != nil {
err = status.Errorf(codes.DeadlineExceeded, e.Error())
ro := req.Readonly ||
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY ||
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY
if err = n.mounter.Mount(ctx, req.VolumeId, backend.MountTarget(req.TargetPath), namedRef, ro); err != nil {
err = status.Error(codes.Internal, err.Error())
return
}

Expand All @@ -194,17 +187,12 @@ func (n NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpubl
}

if err = n.mounter.Unmount(ctx, req.VolumeId, backend.MountTarget(req.TargetPath)); err != nil {
// TODO(vadasambar): move this to mountexecutor once mountexecutor has `StartUnmounting` function
//TODO: evaluate this metric in the absence of previous async implementation. this update makes mounting synchronous, not async (simpler).
metrics.OperationErrorsCount.WithLabelValues("StartUnmounting").Inc()
err = status.Error(codes.Internal, err.Error())
return
}

// Clear the mountstatus since the volume has been unmounted
// Not doing this will make mount not work properly if the same volume is
// attempted to mount twice
mountstatus.Delete(req.VolumeId)

return &csi.NodeUnpublishVolumeResponse{}, nil
}

Expand Down
6 changes: 3 additions & 3 deletions cmd/plugin/node_server_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ func TestNodePublishVolumeAsync(t *testing.T) {
driver := csicommon.NewCSIDriver(driverName, driverVersion, "fake-node")
assert.NotNil(t, driver)

asyncImagePulls := true
asyncImagePulls := 15 * time.Minute //TODO: determine intended value for this in the context of this test
ns := NewNodeServer(driver, mounter, criClient, &testSecretStore{}, asyncImagePulls)

// based on kubelet's csi mounter plugin code
Expand Down Expand Up @@ -168,7 +168,7 @@ func TestNodePublishVolumeSync(t *testing.T) {
driver := csicommon.NewCSIDriver(driverName, driverVersion, "fake-node")
assert.NotNil(t, driver)

asyncImagePulls := false
asyncImagePulls := 15 * time.Minute //TODO: determine intended value for this in the context of this test
ns := NewNodeServer(driver, mounter, criClient, &testSecretStore{}, asyncImagePulls)

// based on kubelet's csi mounter plugin code
Expand Down Expand Up @@ -297,7 +297,7 @@ func TestMetrics(t *testing.T) {
driver := csicommon.NewCSIDriver(driverName, driverVersion, "fake-node")
assert.NotNil(t, driver)

asyncImagePulls := true
asyncImagePulls := 15 * time.Minute //TODO: determine intended value for this in the context of this test
ns := NewNodeServer(driver, mounter, criClient, &testSecretStore{}, asyncImagePulls)

// based on kubelet's csi mounter plugin code
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ require (
github.com/pkg/errors v0.9.1
github.com/prometheus/client_golang v1.12.1
github.com/spf13/pflag v1.0.5
github.com/stretchr/testify v1.8.0
github.com/stretchr/testify v1.8.4
github.com/warm-metal/csi-drivers v0.5.0-alpha.0.0.20210404173852-9ec9cb097dd2
golang.org/x/net v0.0.0-20221004154528-8021a29435af
google.golang.org/grpc v1.50.0
Expand Down
3 changes: 2 additions & 1 deletion go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -826,8 +826,9 @@ github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5
github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.0 h1:pSgiaMZlXftHpm5L7V1+rVB+AZJydKsMxsQBIJw4PKk=
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
github.com/subosito/gotenv v1.2.0/go.mod h1:N0PQaV/YGNqwC0u51sEeR/aUtSLEXKX9iv69rRypqCw=
github.com/syndtr/gocapability v0.0.0-20170704070218-db04d3cc01c8/go.mod h1:hkRG7XYTFWNJGYcbNJQlaLq0fg1yr4J4t/NcTQtrfww=
github.com/syndtr/gocapability v0.0.0-20180916011248-d98352740cb2/go.mod h1:hkRG7XYTFWNJGYcbNJQlaLq0fg1yr4J4t/NcTQtrfww=
Expand Down
Loading
Loading