-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dump and restore internal state #277
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -48,13 +48,7 @@ type FileSystems struct { | |||
Filsystem []ContainerFileSystem `json:"filesystems"` | |||
} | |||
|
|||
func getSourcePath(volumeHandle string) string { | |||
return fmt.Sprintf("%s/%s", dataRoot, volumeHandle) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a reimplementation of getVolumePath
.
pkg/hostpath/healthcheck.go
Outdated
func (hp *hostPath) checkPVCapacityValid(volumeHandle string) (bool, error) { | ||
sourcePath := getSourcePath(volumeHandle) | ||
func (hp *hostPath) checkPVCapacityValid(volID string) (bool, error) { | ||
sourcePath := hp.getVolumePath(volID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using volumeHandle
instead of volID
as in the rest of the code confused me ("what is this handle?!"), so I replaced it in the entire file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PV, the id to a physical volume on the storage backend is called VolumeHandle. In CSI spec, it is called volume_id. They are the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when both are the same, the cognitive overhead for understanding the source code is higher when some parts use one term and other parts use the other. Everything else uses "volume ID", so this code should do so too.
} | ||
|
||
return strings.TrimSuffix(strings.TrimPrefix(volumeSP, "[/var/lib/csi-hostpath-data/"), "]") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions are only used by one test, so I moved them there.
655d336
to
8869aeb
Compare
This replaces the previous approach, trying to reconstruct state from observations, with a simpler dump/restore of the internal state as a JSON file in the driver's data directory. The advantage is that *all* volume and snapshot attributes get restored, not just those that can be deducted from mount points. No attempts are made to restore state properly after a node reboot.
return convertSnapshot(snapshot), nil | ||
} | ||
return &csi.ListSnapshotsResponse{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return
here was missing earlier. It only passed because the test is not very precise:
https://github.com/kubernetes-csi/csi-test/blob/a251c44fd49d9eedd55a40b71e5da4ad080ba431/pkg/sanity/controller.go#L1157-L1165
I created an issue for it:
kubernetes-csi/csi-test#335
if snapshot.VolID == req.SourceVolumeId { | ||
return convertSnapshot(snapshot), nil | ||
} | ||
} | ||
return &csi.ListSnapshotsResponse{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@fengzixu this is what I had in mind: a minimal "state" package whose only responsibility is to store volume and snapshot structs. The overall patch is quite a bit smaller than what you currently have in #276 despite including some cleanup changes that aren't really necessary ( What do you think? /cc @msau42 @xing-yang |
Yep. The current implementation is fine for me. It should be enough for fixing the current bug. THanks. Let me close my PR. |
@xing-yang can you review? |
@@ -46,6 +46,7 @@ func main() { | |||
|
|||
flag.StringVar(&cfg.Endpoint, "endpoint", "unix://tmp/csi.sock", "CSI endpoint") | |||
flag.StringVar(&cfg.DriverName, "drivername", "hostpath.csi.k8s.io", "name of the driver") | |||
flag.StringVar(&cfg.StateDir, "statedir", "/csi-data-dir", "directory for storing state information across driver restarts, volumes and snapshots") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this configurable is useful because then it is possible again to run the driver as non-root with a state directory in /tmp.
Node operations cannot be tested easily that way (need root for mounting), but even that could be achieved with wrapper scripts that rely on sudo.
if len(req.GetSourceVolumeId()) != 0 { | ||
for _, snapshot := range hp.snapshots { | ||
for _, snapshot := range hp.state.GetSnapshots() { | ||
if snapshot.VolID == req.SourceVolumeId { | ||
return convertSnapshot(snapshot), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns as soon as one snapshot's volume id matches the request, but it is possible that there are multiple snapshots with the same source volume id. I see that this is in the original code so this can be fixed in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's fix this separately. Please file a bug so that we don't forget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also we should have a csi-sanity test for it...
pkg/hostpath/healthcheck.go
Outdated
func doHealthCheckInNodeSide(volumeHandle string) (bool, string) { | ||
mpExist, err := checkMountPointExist(volumeHandle) | ||
func (hp *hostPath) doHealthCheckInNodeSide(volID string) (bool, string) { | ||
sourcePath := hp.getVolumePath(volID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourcePath -> volumePath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change then also should percolate to other places, like checkMountPointExist
. I've pushed a commit with that change.
Please take another look.
pkg/hostpath/healthcheck.go
Outdated
func (hp *hostPath) doHealthCheckInControllerSide(volumeHandle string) (bool, string) { | ||
spExist, err := checkSourcePathExist(volumeHandle) | ||
func (hp *hostPath) doHealthCheckInControllerSide(volID string) (bool, string) { | ||
sourcePath := hp.getVolumePath(volID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourcePath -> volumePath
This looks fine. I've asked @fengzixu to validate the volume health part. |
The "sourcePath" name is not quite right: it is a "volume path" which just happens to be used as source in some places.
I tested volume health part based on this PR, test cases as below are fine
I also tested similar cases for "source path doesn't exist", it's also fine cc @xing-yang |
/lgtm |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This replaces the previous approach, trying to reconstruct state from
observations, with a simpler dump/restore of the internal state as a
JSON file in the driver's data directory. That old code was broken (did
not distinguish between own volumes and foreign volumes) and
incomplete (not all state restored).
No attempts are made to restore state properly after a node reboot.
Which issue(s) this PR fixes:
#210 (comment)
Special notes for your reviewer:
Does this PR introduce a user-facing change?: