Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Restore): Async restore operations. #5704

Merged
merged 18 commits into from
Jul 1, 2020
Merged

Conversation

martinmr
Copy link
Contributor

@martinmr martinmr commented Jun 22, 2020

Previously, restore operations blocked the HTTP response until they were completed.
This PR changes the flow so that the request returns immediately after the restore
is proposed to the cluster.

The request returns a unique ID for the restore operation that can be used to track the
status of the response.

Fixes DGRAPH-1697


This change is Reviewable

Docs Preview: Dgraph Preview

@martinmr martinmr changed the title Track restore operations. feat(Restore): Async restore operations. Jun 23, 2020
@martinmr martinmr marked this pull request as ready for review June 23, 2020 22:56
@parasssh
Copy link
Contributor

Before we review, isn't backup also blocking which could also potentially take several minutes? Maybe we are ok with blocking behavior on restore as well?

Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think it's OK for any long running operation to have a hanging request. Both from a user-friendly perspective and because cancelling the requests could lead to issues. For backups it's not as important because backups can be redone more easily than restores (and I think backups are faster than restores). The main purpose of this PR is to prevent cancelled requests from leaving the restore in an incomplete state.

Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @manishrjain, @MichaelJCompton, @pawanrawal, and @vvbalaji-dgraph)

Copy link
Contributor

@parasssh parasssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @manishrjain, @martinmr, @MichaelJCompton, @pawanrawal, and @vvbalaji-dgraph)


graphql/admin/endpoints_ee.go, line 130 at r2 (raw file):

		A short string indicating whether the operation was successful.
		"""
		code: String

With async block, success here simply means that the restore is scheduled to happen. It is not an indication that the restore is done.


graphql/admin/endpoints_ee.go, line 135 at r2 (raw file):

		Includes the error message if the operation failed.
		"""
		message: String

With async, what errors would this be?


graphql/admin/endpoints_ee.go, line 140 at r2 (raw file):

		The unique ID that can be used to query the status of the restore operation.
		"""
		restoreId: String

if its just a random string, may be a UUID int may be better.


graphql/admin/endpoints_ee.go, line 225 at r2 (raw file):

		The status of the restore operation. One of UNKNOWN, IN_PROGRESS, OK, or ERR.
		"""
		status: String!

are we allowing multiple Restores on the same alpha? If yes, then we should have a QUEUED status.


graphql/admin/restore.go, line 69 at r2 (raw file):

			Data: map[string]interface{}{m.Name(): map[string]interface{}{
				"code":      "Failure",
				"restoreId": restoreId,

maybe not have any restoreId on error. It is of no use on errors.


graphql/admin/restore_status.go, line 40 at r2 (raw file):

			Data: map[string]interface{}{q.Name(): map[string]interface{}{
				"status": "UNKNOWN",
			}},

We could pass the err in the response.


graphql/admin/restore_status.go, line 51 at r2 (raw file):

			Data: map[string]interface{}{q.Name(): map[string]interface{}{
				"status": "UNKNOWN",
			}},

same here. Bubble up the error to the client.


worker/backup_common.go, line 108 at r2 (raw file):

	sync.RWMutex
	// Status is a map of restore task ID to the Status of said task.
	status  map[string]*RestoreStatus

If we only allow one Restore at a time on the same alpha instance, then we don't need any map.


worker/backup_common.go, line 128 at r2 (raw file):

		return status
	}
	return &RestoreStatus{Status: "UNKNOWN"}

Can we define these as constants/enum some where?


worker/backup_common.go, line 140 at r2 (raw file):

	rt.counter += 1
	restoreId := fmt.Sprintf("restore-%d", rt.counter)

not sure why restoreID is a string. We could simply use the counter


worker/backup_common.go, line 174 at r2 (raw file):

		status = "ERR"
	}
	rt.status[restoreId] = &RestoreStatus{Status: status, Errors: validErrs}

at what point is the entry completely removed from the map?


worker/online_restore_ee.go, line 36 at r2 (raw file):

// ProcessRestoreRequest verifies the backup data and sends a restore proposal to each group.
func ProcessRestoreRequest(ctx context.Context, req *pb.RestoreRequest) (string, error) {
	restoreId, err := rt.Add()

I think we should assign a restoreID after all the sanity checks below are done.


worker/online_restore_ee.go, line 81 at r2 (raw file):

		}()
	}

I think this is where we should rt.Add() instead of at the start

Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @manishrjain, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)


graphql/admin/endpoints_ee.go, line 130 at r2 (raw file):

Previously, parasssh wrote…

With async block, success here simply means that the restore is scheduled to happen. It is not an indication that the restore is done.

Yes, the message indicates that the restore was scheduled. I'll update the help string to match the new behavior.


graphql/admin/endpoints_ee.go, line 135 at r2 (raw file):

Previously, parasssh wrote…

With async, what errors would this be?

failing to verify the backup, for example. Only thing that is async is waiting for the proposals to be done.


graphql/admin/endpoints_ee.go, line 140 at r2 (raw file):

Previously, parasssh wrote…

if its just a random string, may be a UUID int may be better.

I changed it to an int following your comment below.


graphql/admin/endpoints_ee.go, line 225 at r2 (raw file):

Previously, parasssh wrote…

are we allowing multiple Restores on the same alpha? If yes, then we should have a QUEUED status.

I am planning to add a lock so that only one restore request per alpha is allowed. But that's in a different PR.


graphql/admin/restore.go, line 69 at r2 (raw file):

Previously, parasssh wrote…

maybe not have any restoreId on error. It is of no use on errors.

I am still sending it so that users can track the restore in the restoreStatus endpoint and to respect the format of their query. If they request a restoreId we should give it to them if we have one.


graphql/admin/restore_status.go, line 40 at r2 (raw file):

Previously, parasssh wrote…

We could pass the err in the response.

Done.


graphql/admin/restore_status.go, line 51 at r2 (raw file):

Previously, parasssh wrote…

same here. Bubble up the error to the client.

Done.


worker/backup_common.go, line 108 at r2 (raw file):

Previously, parasssh wrote…

If we only allow one Restore at a time on the same alpha instance, then we don't need any map.

We need the map to be able to report about past restore operations.


worker/backup_common.go, line 128 at r2 (raw file):

Previously, parasssh wrote…

Can we define these as constants/enum some where?

Done. Defined these strings as constants.


worker/backup_common.go, line 140 at r2 (raw file):

Previously, parasssh wrote…

not sure why restoreID is a string. We could simply use the counter

Done.


worker/backup_common.go, line 174 at r2 (raw file):

Previously, parasssh wrote…

at what point is the entry completely removed from the map?

There's no cleanup right now. I could add it but I didn't think there will be enough restore operations that this can become an issue.


worker/online_restore_ee.go, line 36 at r2 (raw file):

Previously, parasssh wrote…

I think we should assign a restoreID after all the sanity checks below are done.

Done.


worker/online_restore_ee.go, line 81 at r2 (raw file):

Previously, parasssh wrote…

I think this is where we should rt.Add() instead of at the start

Done.

Copy link
Contributor

@parasssh parasssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 19 unresolved discussions (waiting on @manishrjain, @martinmr, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)


graphql/admin/restore.go, line 69 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

I am still sending it so that users can track the restore in the restoreStatus endpoint and to respect the format of their query. If they request a restoreId we should give it to them if we have one.

They shouldn't need to track if there is error at scheduling time (the sync part of their request). In fact, ProcessRestoreRequest() should not return restoreId and remove it from map as well on error.


graphql/admin/restore_status.go, line 49 at r3 (raw file):

		return unknownStatus(q, err)
	}
	convertedStatus := convertStatus(status)

convertStatus can return nil. We should check here for nil and bail with unknownStatus(q, err)


worker/backup_common.go, line 174 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

There's no cleanup right now. I could add it but I didn't think there will be enough restore operations that this can become an issue.

It could be a crude cleanup. Example when id counter is at x, delete entry with id = x - 50. You get the idea.


worker/backup_common.go, line 31 at r3 (raw file):

	okStatus         = "OK"
	errStatus        = "ERR"
)

to confirm, we dont have queued status because we will deny a subsequent restore() request on the same alpha if one is already in progress (which will be in a separate PR), right? And the reason, we still keep a map of id -> status is for client to be able to query their restore's status.

(Please add something to this effect in the RFC for reference)


worker/backup_common.go, line 137 at r3 (raw file):

}

func (rt *restoreTracker) Add() (int, error) {

In Add() you can check if any of the map entries has a restoreStatus = IN_PROGRESS, then deny this Add(). You will have to loop thru the map but we dont expect too many entries in the map anyways.


worker/backup_common.go, line 166 at r3 (raw file):

	}

	status := okStatus

okStatus is confusing if there is an err found in the loop below. Maybe Set it to ERR status if there is even one error after the for loop.


worker/online_restore_ee.go, line 77 at r3 (raw file):

	}

	restoreId, err := rt.Add()

This should fail when a restore is in progress.


worker/worker.go, line 61 at r3 (raw file):

func Init(ps *badger.DB) {
	pstore = ps
	rt = newRestoreTracker()

check for nil rt?

Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 19 unresolved discussions (waiting on @manishrjain, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)


graphql/admin/restore.go, line 69 at r2 (raw file):

Previously, parasssh wrote…

They shouldn't need to track if there is error at scheduling time (the sync part of their request). In fact, ProcessRestoreRequest() should not return restoreId and remove it from map as well on error.

Done. I am not returning and added a method to delete restoreIds in this case.


graphql/admin/restore_status.go, line 49 at r3 (raw file):

Previously, parasssh wrote…

convertStatus can return nil. We should check here for nil and bail with unknownStatus(q, err)

Done. That's only when status is nil so I added a check to return an unknown status if it's nil.


worker/backup_common.go, line 174 at r2 (raw file):

Previously, parasssh wrote…

It could be a crude cleanup. Example when id counter is at x, delete entry with id = x - 50. You get the idea.

Done.


worker/backup_common.go, line 31 at r3 (raw file):

Previously, parasssh wrote…

to confirm, we dont have queued status because we will deny a subsequent restore() request on the same alpha if one is already in progress (which will be in a separate PR), right? And the reason, we still keep a map of id -> status is for client to be able to query their restore's status.

(Please add something to this effect in the RFC for reference)

Yes, that's correct. I'll update the RFC.


worker/backup_common.go, line 137 at r3 (raw file):

Previously, parasssh wrote…

In Add() you can check if any of the map entries has a restoreStatus = IN_PROGRESS, then deny this Add(). You will have to loop thru the map but we dont expect too many entries in the map anyways.

Done.


worker/backup_common.go, line 166 at r3 (raw file):

Previously, parasssh wrote…

okStatus is confusing if there is an err found in the loop below. Maybe Set it to ERR status if there is even one error after the for loop.

Done.


worker/online_restore_ee.go, line 77 at r3 (raw file):

Previously, parasssh wrote…

This should fail when a restore is in progress.

Done.


worker/worker.go, line 61 at r3 (raw file):

Previously, parasssh wrote…

check for nil rt?

newRestoreTracker never returns nil. Also, the methods in the restore tracker already handle cases where it's nil.

Copy link
Contributor

@parasssh parasssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: minor comments.

Reviewable status: 0 of 10 files reviewed, 21 unresolved discussions (waiting on @manishrjain, @martinmr, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)


graphql/admin/endpoints_ee.go, line 229 at r4 (raw file):

	type RestoreStatus {
		"""
		The status of the restore operation. One of UNKNOWN, IN_PROGRESS, OK, or ERR.

Do we do all CAPS? Feel free to resolve.


graphql/admin/restore_status.go, line 72 at r4 (raw file):

func convertStatus(status *worker.RestoreStatus) *restoreStatus {
	if status == nil {
		return nil

You mentioned above you will return unknown status here. Maybe you missed it.

Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 21 unresolved discussions (waiting on @manishrjain, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)


graphql/admin/endpoints_ee.go, line 229 at r4 (raw file):

Previously, parasssh wrote…

Do we do all CAPS? Feel free to resolve.

Not really but I thought it would be useful to indicate they are kind of like enums. I don't think enums are supported in GraphQL so I did this instead.


graphql/admin/restore_status.go, line 72 at r4 (raw file):

Previously, parasssh wrote…

You mentioned above you will return unknown status here. Maybe you missed it.

I added the check in resolveRestoreStatus to check if the status is nil. If that's the case we return. so convertStatus won't be called with a nil value.

@martinmr martinmr merged commit cc60659 into master Jul 1, 2020
@martinmr martinmr deleted the martinmr/restore-track branch July 2, 2020 17:59
martinmr added a commit that referenced this pull request Jul 2, 2020
Previously, restore operations blocked the HTTP response until they were completed.
This PR changes the flow so that the request returns immediately after the restore
is proposed to the cluster.

The request returns a unique ID for the restore operation that can be used to track the
status of the response.

Fixes DGRAPH-1697
arijitAD pushed a commit that referenced this pull request Jul 14, 2020
Previously, restore operations blocked the HTTP response until they were completed.
This PR changes the flow so that the request returns immediately after the restore
is proposed to the cluster.

The request returns a unique ID for the restore operation that can be used to track the
status of the response.

Fixes DGRAPH-1697
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 18, 2020
Previously, restore operations blocked the HTTP response until they were completed.
This PR changes the flow so that the request returns immediately after the restore
is proposed to the cluster.

The request returns a unique ID for the restore operation that can be used to track the
status of the response.

Fixes DGRAPH-1697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants