-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define itemoperations.json format and update DownloadRequest API. #5752
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
This is to support uploading async operation metadata to object storage to support progress monitoring. Signed-off-by: Scott Seago <sseago@redhat.com>
- v1.15.1
- v1.15.1-rc.1
- v1.15.0
- v1.15.0-rc.2
- v1.15.0-rc.1
- v1.14.1
- v1.14.1-rc.1
- v1.14.0
- v1.14.0-rc.3
- v1.14.0-rc.2
- v1.14.0-rc.1
- v1.13.2
- v1.13.2-rc.1
- v1.13.1
- v1.13.1-rc.1
- v1.13.0
- v1.13.0-rc.2
- v1.13.0-rc.1
- v1.12.4
- v1.12.4-rc.1
- v1.12.3
- v1.12.3-rc.1
- v1.12.2
- v1.12.2-rc.2
- v1.12.2-rc.1
- v1.12.1
- v1.12.1-rc.1
- v1.12.0
- v1.12.0-rc.2
- v1.12.0-rc.1
- v1.11.1
- v1.11.1-rc.1
- v1.11.0
- v1.11.0-rc.2
- v1.11.0-rc.1
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Define itemoperations.json format and update DownloadRequest API |
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
/* | ||
Copyright the Velero contributors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package itemoperation | ||
|
||
// BackupOperation stores information about an async item operation | ||
// started by a BackupItemAction plugin (v2 or later) | ||
type BackupOperation struct { | ||
Spec BackupOperationSpec `json:"spec"` | ||
|
||
Status OperationStatus `json:"status"` | ||
} | ||
|
||
type BackupOperationSpec struct { | ||
// BackupName is the name of the Velero backup this item operation | ||
// is associated with. | ||
BackupName string `json:"backupName"` | ||
|
||
// BackupUID is the UID of the Velero backup this item operation | ||
// is associated with. | ||
BackupUID string `json:"backupUID"` | ||
|
||
// BackupItemAction is the name of the BackupItemAction plugin that started the operation | ||
BackupItemAction string `json:"backupItemAction"` | ||
|
||
// Kubernetes resource identifier for the item | ||
ResourceIdentifier string "json:resourceIdentifier" | ||
|
||
// OperationID returned by the BIA plugin | ||
OperationID string "json:operationID" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also applied to RestoreOperation How to we cope with this problem? Should we add a plugin specific section in the BackupOperationSpec? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. That's a good question. So in most cases, the operation ID will actually correspond to a CR created and processed by an external controller. For example, a datamover plugin operating on a VolumeSnapshotContent object will create a datamover CR and the operation ID will just be an object reference to that. In the event of a plugin restart, that id is all we need, since the progress method will simply GET the datamover CR and report status from that. If it is an operation internal to the plugin, then we can't necessarily assume that the operation is even recoverable, so an internal plugin async operation probably just errors out on a crash -- calling progress in this case will result in the plugin not being able to find the operation ID, so it will report an error which will let velero know to add an error to the backup/restore and move to partial failure. I don't think adding a plugin-specific config map or something similar here will really solve the problem, and in addition, it would be challenging to find a way to return data of unknown type/size via the protobuf GRPC framework. I think this is ultimately a design consideration on the part of plugin authors -- if you want plugin-crash-resilient longrunning operations, implement them in an external controller, communicating with the plugin via CRs. Implementing async operations as a goroutine or something similar inside the plugin process will be inherently more fragile. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I totally understand your concern. On the other hand, I don't know which way is more reasonable for our async operation model. Let's map this to a case in the real life:
If we don't allow B to have any internal state and context, things have to be like this:
Is this what we want? How do we define B's role as it looks very thin? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's nothing in the approved design that would account for augmenting the api with plugin-specific data structures in response. This won't be needed for any of the currently-identified use cases (datamover, etc.). SInce it's not in the approved design and adding it to the GRPC API might be challenging, I would suggest putting anything like that off for a future v3 API update if and when we identify a use case that requires it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This indeed affects one of the two proposals in the current Velero built-in data movement design, let's discuss this in the community meeting. If we decide to drop the proposal and take the other one, I have no other concern for postposing this to v3 API, just one thing to mention:
Then going to the data mover case, we can say that, once the module that hosts the SnapshotBackup/Restore controller restarts, the existing backup/restore will fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I think we have 2 questions to answer here:
Either way, if we need 1), someone will need to submit a design proposal for these additions. Since we don't even know for sure whether the internal datamover needs this, we're not in a position to implement and merge a PR based on this right now. This work can be done in follow-on PRs as modifications to the implementation of the previously-approved design. As for 2) -- when we need this, if the internal datamover needs this, then we may want to get it into v2 API -- but the difference between getting it in v2 vs. v3 API is just whether we intend to merge it for Velero 1.11 vs. waiting for 1.12. Either way, it should be done as follow-on PRs since holding up current implementation of an approved design for some additional features that haven't been fully defined yet would endanger the release schedule. Lets implement the current design as agreed while we start iterating on what we want to add, if anything. Even if we decide we need it for 1.11, we're not going to save any time by holding current PRs while we iterate over the next design phase. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree on the plan, let's see if the bult-in data movement needs it after the discussion. |
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also applied to RestoreOperation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will also be used in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can I say that - the plugin restart doesn't use this json?
So the json is not used at all. Am I correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This idea sounds good. But let's think one more step: Another question is, who is going to update the status in the file since Velero has started to execute backup 2? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Lyndon-Li That's right. The plugin itself never needs access to this json. Rather Velero's controllers need the json content in order to know which plugins to poll for progress updates. The plugin only ever needs information from Velero as defined by the plugin API -- i.e. for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Velero will update the itemoperations.json status each time is runs through the list of outstanding operations and calls Progress on them for a given backup. So in the above example, backup1 completes the main backup process and initial round of Progress calls, updates itemoperations along with the rest of the backup (backup json, log, etc.). Then it runs through backup2. Then it reconciles backup1 again, making progress calls on the outstanding operations, and if there are any status changes, updates the itemoperations json file again via the object store API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Let me further test this by debugging. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Lyndon-Li That makes sense. It may be worth holding off on certain json updates, keeping things in memory for performance reasons. None of that affects this PR, though -- the interface for uploading/downloading this json if/when we do want it is unchanged. We will definitely upload it the first time we update the backup (after initial processing and first pass at operation progress). This will ensure that the full list of operations is persisted so that velero can recover from a restart after this point). If we decide not to update the backup store on every intermediate reconcile, then we can implement the in-memory cache you propose above, and on any subsequent reconcile (i.e. after initial backup run), we check for in-memory progress struct first. if found, we use it, else pull it from object store. At the end of reconcile, if we're completing the backup, we upload it and remove it from the in-memory list of current backups, and if the backup is still not done, we just update in-memory store and move on. The two main consequences of the above:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds good, thanks. For any other details, let's discuss in the implementation PR of backup controller and download request controller. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
/* | ||
Copyright the Velero contributors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package itemoperation | ||
|
||
// RestoreOperation stores information about an async item operation | ||
// started by a RestoreItemAction plugin (v2 or later) | ||
type RestoreOperation struct { | ||
Spec RestoreOperationSpec `json:"spec"` | ||
|
||
Status OperationStatus `json:"status"` | ||
} | ||
|
||
type RestoreOperationSpec struct { | ||
// RestoreName is the name of the Velero restore this item operation | ||
// is associated with. | ||
RestoreName string `json:"restoreName"` | ||
|
||
// RestoreUID is the UID of the Velero restore this item operation | ||
// is associated with. | ||
RestoreUID string `json:"restoreUID"` | ||
|
||
// RestoreItemAction is the name of the RestoreItemAction plugin that started the operation | ||
RestoreItemAction string `json:"restoreItemAction"` | ||
|
||
// Kubernetes resource identifier for the item | ||
ResourceIdentifier string "json:resourceIdentifier" | ||
|
||
// OperationID returned by the RIA plugin | ||
OperationID string "json:operationID" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
/* | ||
Copyright the Velero contributors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package itemoperation | ||
|
||
import ( | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
) | ||
|
||
// OperationPhase is the lifecycle phase of a Velero item operation | ||
type OperationPhase string | ||
|
||
type OperationStatus struct { | ||
// Phase is the current state of the item operation. | ||
Phase OperationPhase `json:"phase,omitempty"` | ||
|
||
// Error displays the reason for a failed operation | ||
Error string `json:"error,omitempty"` | ||
|
||
// Amount of operation completed (measured in OperationUnits) | ||
// i.e. number of bytes transferred for a volume | ||
NCompleted int64 `json:"nCompleted,omitempty"` | ||
|
||
// Total Amount of operation (measured in OperationUnits) | ||
// i.e. volume size in bytes | ||
NTotal int64 `json:"nTotal,omitempty"` | ||
|
||
// Units that NCompleted,NTotal are measured in | ||
// i.e. "bytes" | ||
OperationUnits int64 `json:"nTotal,omitempty"` | ||
|
||
// Started records the time the item operation was started, if known | ||
// +optional | ||
// +nullable | ||
Started *metav1.Time `json:"start,omitempty"` | ||
|
||
// Updated records the time the item operation was updated, if known. | ||
// +optional | ||
// +nullable | ||
Updated *metav1.Time `json:"updated,omitempty"` | ||
} | ||
|
||
const ( | ||
// OperationPhaseNew means the item operation has been created and started | ||
// by the plugin | ||
OperationPhaseInProgress OperationPhase = "New" | ||
|
||
// OperationPhaseCompleted means the item operation was successfully completed | ||
// and can be used for restore. | ||
OperationPhaseCompleted OperationPhase = "Completed" | ||
|
||
// OperationPhaseFailed means the item operation ended with an error. | ||
OperationPhaseFailed OperationPhase = "Failed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also applied to RestoreOperation
Do we need to sync this itemoperations.json for backup sync? Consider below cases:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think there are two questions here:
These are separate questions -- I think we want the data because it will be displayed in
velero describe backup
to give full status information on a prior backup -- i.e. it's PartiallyFailed and one operation from this list ended in error. That being said, these json files in object storage are not actually synced to the cluster. Only the kubernetes CRs are synced to the cluster. Things like the itemoperations.json are pulled from the cluster as needed via DownloadRequests, so there's not actually a concern about syncing this data file to the cluster.But you are correct that velero should not attempt to reconcile operations on a synced backup, Backups pulled from object storage should be considered complete. Currently we don't write a backup to object storage until it's marked complete. With this new feature, we will be writing backups to object storage earlier -- when moving to WaitingForPluginOperations phase. I think the solution here is that when the implementation of the controller logic is added for operation monitoring, we update the logic in the backup sync controller and we don't sync backups that are not in a terminal state -- i.e. if we find 5 backups in the BSL, 2 with phase Completed, 2 with phase PartiallyFailed, and 1 with phase WaitingForPluginOperations, we only sync the first four. The last one won't get synced into remote clusters until it reaches a terminal state.
@shubham-pampattiwar @dymurray does the above look right to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the json is only used to provide information to
velero describe backup
, then we don't need to sync it in backup sync, we only pull it on demand and discard it when we finish.Calling BIA V2 in a backup sync scenarios will make things complicated(as the cases I listed above) and in most cases the call would fail. So personally I agree with the idea that we don't sync the backup until it reaches to a terminal state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li Yes. We never pull things like itemoperations or logs in sync. Backup sync only pulls down the backup json to recreate the velero backup resource -- so I think we're already on the same page here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @sseago @Lyndon-Li We will keep this discussion in mind when making the controller changes.