feat(Restore): Async restore operations. #5704

martinmr · 2020-06-22T23:40:29Z

Previously, restore operations blocked the HTTP response until they were completed.
This PR changes the flow so that the request returns immediately after the restore
is proposed to the cluster.

The request returns a unique ID for the restore operation that can be used to track the
status of the response.

Fixes DGRAPH-1697

This change is

Docs Preview:

parasssh · 2020-06-24T19:08:17Z

Before we review, isn't backup also blocking which could also potentially take several minutes? Maybe we are ok with blocking behavior on restore as well?

martinmr

Personally, I don't think it's OK for any long running operation to have a hanging request. Both from a user-friendly perspective and because cancelling the requests could lead to issues. For backups it's not as important because backups can be redone more easily than restores (and I think backups are faster than restores). The main purpose of this PR is to prevent cancelled requests from leaving the restore in an incomplete state.

Reviewable status: 0 of 10 files reviewed, all discussions resolved (waiting on @manishrjain, @MichaelJCompton, @pawanrawal, and @vvbalaji-dgraph)

parasssh

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @manishrjain, @martinmr, @MichaelJCompton, @pawanrawal, and @vvbalaji-dgraph)

graphql/admin/endpoints_ee.go, line 130 at r2 (raw file):

		A short string indicating whether the operation was successful.
		"""
		code: String

With async block, success here simply means that the restore is scheduled to happen. It is not an indication that the restore is done.

graphql/admin/endpoints_ee.go, line 135 at r2 (raw file):

		Includes the error message if the operation failed.
		"""
		message: String

With async, what errors would this be?

graphql/admin/endpoints_ee.go, line 140 at r2 (raw file):

		The unique ID that can be used to query the status of the restore operation.
		"""
		restoreId: String

if its just a random string, may be a UUID int may be better.

graphql/admin/endpoints_ee.go, line 225 at r2 (raw file):

		The status of the restore operation. One of UNKNOWN, IN_PROGRESS, OK, or ERR.
		"""
		status: String!

are we allowing multiple Restores on the same alpha? If yes, then we should have a QUEUED status.

graphql/admin/restore.go, line 69 at r2 (raw file):

			Data: map[string]interface{}{m.Name(): map[string]interface{}{
				"code":      "Failure",
				"restoreId": restoreId,

maybe not have any restoreId on error. It is of no use on errors.

graphql/admin/restore_status.go, line 40 at r2 (raw file):

			Data: map[string]interface{}{q.Name(): map[string]interface{}{
				"status": "UNKNOWN",
			}},

We could pass the err in the response.

graphql/admin/restore_status.go, line 51 at r2 (raw file):

			Data: map[string]interface{}{q.Name(): map[string]interface{}{
				"status": "UNKNOWN",
			}},

same here. Bubble up the error to the client.

worker/backup_common.go, line 108 at r2 (raw file):

	sync.RWMutex
	// Status is a map of restore task ID to the Status of said task.
	status  map[string]*RestoreStatus

If we only allow one Restore at a time on the same alpha instance, then we don't need any map.

worker/backup_common.go, line 128 at r2 (raw file):

		return status
	}
	return &RestoreStatus{Status: "UNKNOWN"}

Can we define these as constants/enum some where?

worker/backup_common.go, line 140 at r2 (raw file):

	rt.counter += 1
	restoreId := fmt.Sprintf("restore-%d", rt.counter)

not sure why restoreID is a string. We could simply use the counter

worker/backup_common.go, line 174 at r2 (raw file):

		status = "ERR"
	}
	rt.status[restoreId] = &RestoreStatus{Status: status, Errors: validErrs}

at what point is the entry completely removed from the map?

worker/online_restore_ee.go, line 36 at r2 (raw file):

// ProcessRestoreRequest verifies the backup data and sends a restore proposal to each group.
func ProcessRestoreRequest(ctx context.Context, req *pb.RestoreRequest) (string, error) {
	restoreId, err := rt.Add()

I think we should assign a restoreID after all the sanity checks below are done.

worker/online_restore_ee.go, line 81 at r2 (raw file):

		}()
	}

I think this is where we should rt.Add() instead of at the start

martinmr

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @manishrjain, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)

graphql/admin/endpoints_ee.go, line 130 at r2 (raw file):

Previously, parasssh wrote…

With async block, success here simply means that the restore is scheduled to happen. It is not an indication that the restore is done.

Yes, the message indicates that the restore was scheduled. I'll update the help string to match the new behavior.

graphql/admin/endpoints_ee.go, line 135 at r2 (raw file):

Previously, parasssh wrote…

With async, what errors would this be?

failing to verify the backup, for example. Only thing that is async is waiting for the proposals to be done.

graphql/admin/endpoints_ee.go, line 140 at r2 (raw file):

Previously, parasssh wrote…

if its just a random string, may be a UUID int may be better.

I changed it to an int following your comment below.

graphql/admin/endpoints_ee.go, line 225 at r2 (raw file):

Previously, parasssh wrote…

are we allowing multiple Restores on the same alpha? If yes, then we should have a QUEUED status.

I am planning to add a lock so that only one restore request per alpha is allowed. But that's in a different PR.

graphql/admin/restore.go, line 69 at r2 (raw file):

Previously, parasssh wrote…

maybe not have any restoreId on error. It is of no use on errors.

I am still sending it so that users can track the restore in the restoreStatus endpoint and to respect the format of their query. If they request a restoreId we should give it to them if we have one.

graphql/admin/restore_status.go, line 40 at r2 (raw file):

Previously, parasssh wrote…

We could pass the err in the response.

Done.

graphql/admin/restore_status.go, line 51 at r2 (raw file):

Previously, parasssh wrote…

same here. Bubble up the error to the client.

Done.

worker/backup_common.go, line 108 at r2 (raw file):

Previously, parasssh wrote…

If we only allow one Restore at a time on the same alpha instance, then we don't need any map.

We need the map to be able to report about past restore operations.

worker/backup_common.go, line 128 at r2 (raw file):

Previously, parasssh wrote…

Can we define these as constants/enum some where?

Done. Defined these strings as constants.

worker/backup_common.go, line 140 at r2 (raw file):

Previously, parasssh wrote…

not sure why restoreID is a string. We could simply use the counter

Done.

worker/backup_common.go, line 174 at r2 (raw file):

Previously, parasssh wrote…

at what point is the entry completely removed from the map?

There's no cleanup right now. I could add it but I didn't think there will be enough restore operations that this can become an issue.

worker/online_restore_ee.go, line 36 at r2 (raw file):

Previously, parasssh wrote…

I think we should assign a restoreID after all the sanity checks below are done.

Done.

worker/online_restore_ee.go, line 81 at r2 (raw file):

Previously, parasssh wrote…

I think this is where we should rt.Add() instead of at the start

Done.

parasssh

Reviewable status: 0 of 10 files reviewed, 19 unresolved discussions (waiting on @manishrjain, @martinmr, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)

graphql/admin/restore.go, line 69 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

I am still sending it so that users can track the restore in the restoreStatus endpoint and to respect the format of their query. If they request a restoreId we should give it to them if we have one.

They shouldn't need to track if there is error at scheduling time (the sync part of their request). In fact, ProcessRestoreRequest() should not return restoreId and remove it from map as well on error.

graphql/admin/restore_status.go, line 49 at r3 (raw file):

		return unknownStatus(q, err)
	}
	convertedStatus := convertStatus(status)

convertStatus can return nil. We should check here for nil and bail with unknownStatus(q, err)

worker/backup_common.go, line 174 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

There's no cleanup right now. I could add it but I didn't think there will be enough restore operations that this can become an issue.

It could be a crude cleanup. Example when id counter is at x, delete entry with id = x - 50. You get the idea.

worker/backup_common.go, line 31 at r3 (raw file):

	okStatus         = "OK"
	errStatus        = "ERR"
)

to confirm, we dont have queued status because we will deny a subsequent restore() request on the same alpha if one is already in progress (which will be in a separate PR), right? And the reason, we still keep a map of id -> status is for client to be able to query their restore's status.

(Please add something to this effect in the RFC for reference)

worker/backup_common.go, line 137 at r3 (raw file):

}

func (rt *restoreTracker) Add() (int, error) {

In Add() you can check if any of the map entries has a restoreStatus = IN_PROGRESS, then deny this Add(). You will have to loop thru the map but we dont expect too many entries in the map anyways.

worker/backup_common.go, line 166 at r3 (raw file):

	}

	status := okStatus

okStatus is confusing if there is an err found in the loop below. Maybe Set it to ERR status if there is even one error after the for loop.

worker/online_restore_ee.go, line 77 at r3 (raw file):

	}

	restoreId, err := rt.Add()

This should fail when a restore is in progress.

worker/worker.go, line 61 at r3 (raw file):

func Init(ps *badger.DB) {
	pstore = ps
	rt = newRestoreTracker()

check for nil rt?

martinmr

Reviewable status: 0 of 10 files reviewed, 19 unresolved discussions (waiting on @manishrjain, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)

graphql/admin/restore.go, line 69 at r2 (raw file):

Previously, parasssh wrote…

They shouldn't need to track if there is error at scheduling time (the sync part of their request). In fact, ProcessRestoreRequest() should not return restoreId and remove it from map as well on error.

Done. I am not returning and added a method to delete restoreIds in this case.

graphql/admin/restore_status.go, line 49 at r3 (raw file):

Previously, parasssh wrote…

convertStatus can return nil. We should check here for nil and bail with unknownStatus(q, err)

Done. That's only when status is nil so I added a check to return an unknown status if it's nil.

worker/backup_common.go, line 174 at r2 (raw file):

Previously, parasssh wrote…

It could be a crude cleanup. Example when id counter is at x, delete entry with id = x - 50. You get the idea.

Done.

worker/backup_common.go, line 31 at r3 (raw file):

Previously, parasssh wrote…

to confirm, we dont have queued status because we will deny a subsequent restore() request on the same alpha if one is already in progress (which will be in a separate PR), right? And the reason, we still keep a map of id -> status is for client to be able to query their restore's status.

(Please add something to this effect in the RFC for reference)

Yes, that's correct. I'll update the RFC.

worker/backup_common.go, line 137 at r3 (raw file):

Previously, parasssh wrote…

In Add() you can check if any of the map entries has a restoreStatus = IN_PROGRESS, then deny this Add(). You will have to loop thru the map but we dont expect too many entries in the map anyways.

Done.

worker/backup_common.go, line 166 at r3 (raw file):

Previously, parasssh wrote…

okStatus is confusing if there is an err found in the loop below. Maybe Set it to ERR status if there is even one error after the for loop.

Done.

worker/online_restore_ee.go, line 77 at r3 (raw file):

Previously, parasssh wrote…

This should fail when a restore is in progress.

Done.

worker/worker.go, line 61 at r3 (raw file):

Previously, parasssh wrote…

check for nil rt?

newRestoreTracker never returns nil. Also, the methods in the restore tracker already handle cases where it's nil.

parasssh

minor comments.

Reviewable status: 0 of 10 files reviewed, 21 unresolved discussions (waiting on @manishrjain, @martinmr, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)

graphql/admin/endpoints_ee.go, line 229 at r4 (raw file):

	type RestoreStatus {
		"""
		The status of the restore operation. One of UNKNOWN, IN_PROGRESS, OK, or ERR.

Do we do all CAPS? Feel free to resolve.

graphql/admin/restore_status.go, line 72 at r4 (raw file):

func convertStatus(status *worker.RestoreStatus) *restoreStatus {
	if status == nil {
		return nil

You mentioned above you will return unknown status here. Maybe you missed it.

martinmr

Reviewable status: 0 of 10 files reviewed, 21 unresolved discussions (waiting on @manishrjain, @MichaelJCompton, @parasssh, @pawanrawal, and @vvbalaji-dgraph)

graphql/admin/endpoints_ee.go, line 229 at r4 (raw file):

Previously, parasssh wrote…

Do we do all CAPS? Feel free to resolve.

Not really but I thought it would be useful to indicate they are kind of like enums. I don't think enums are supported in GraphQL so I did this instead.

graphql/admin/restore_status.go, line 72 at r4 (raw file):

Previously, parasssh wrote…

You mentioned above you will return unknown status here. Maybe you missed it.

I added the check in resolveRestoreStatus to check if the status is nil. If that's the case we return. so convertStatus won't be called with a nil value.

Previously, restore operations blocked the HTTP response until they were completed. This PR changes the flow so that the request returns immediately after the restore is proposed to the cluster. The request returns a unique ID for the restore operation that can be used to track the status of the response. Fixes DGRAPH-1697

martinmr added 4 commits June 22, 2020 16:39

Initial commit.

6e39208

handle nil cases.

0ef4596

Merge remote-tracking branch 'origin/master' into martinmr/restore-track

cddf9d1

Added graphql interface for tracking restore status.

90e860c

martinmr changed the title ~~Track restore operations.~~ feat(Restore): Async restore operations. Jun 23, 2020

martinmr marked this pull request as ready for review June 23, 2020 22:56

martinmr requested review from manishrjain, MichaelJCompton, pawanrawal and vvbalaji-dgraph as code owners June 23, 2020 22:56

martinmr added 2 commits June 23, 2020 15:58

Handle error returned by Done.

e095b16

Merge remote-tracking branch 'origin/master' into martinmr/restore-track

f417e89

martinmr added 2 commits June 24, 2020 12:10

Fix test.

a874451

Fix ACL test.

f99017c

martinmr commented Jun 24, 2020

View reviewed changes

parasssh suggested changes Jun 26, 2020

View reviewed changes

Merge branch 'master' into martinmr/restore-track

ad1aaac

martinmr commented Jun 26, 2020

View reviewed changes

Address review comments.

47bf8a0

parasssh suggested changes Jun 27, 2020

View reviewed changes

martinmr added 2 commits June 29, 2020 13:39

Merge branch 'master' into martinmr/restore-track

33bc0d6

Address review comments.

2392615

martinmr commented Jun 29, 2020

View reviewed changes

parasssh approved these changes Jun 29, 2020

View reviewed changes

martinmr added 2 commits June 29, 2020 18:21

Merge remote-tracking branch 'origin/master' into martinmr/restore-track

2341bcf

Merge branch 'master' into martinmr/restore-track

4ff91c1

martinmr commented Jun 30, 2020

View reviewed changes

martinmr added 2 commits June 30, 2020 14:23

fix OSS build.

83e376a

Fix formatting issues.

dcf9c4d

martinmr added 2 commits July 1, 2020 12:28

Merge branch 'master' into martinmr/restore-track

aee3ff0

Fix deepsource warnings.

35ff497

martinmr merged commit cc60659 into master Jul 1, 2020

martinmr deleted the martinmr/restore-track branch July 2, 2020 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Restore): Async restore operations. #5704

feat(Restore): Async restore operations. #5704

martinmr commented Jun 22, 2020 •

edited by dgraph-bot

Loading

parasssh commented Jun 24, 2020

martinmr left a comment

parasssh left a comment

martinmr left a comment

parasssh left a comment

martinmr left a comment

parasssh left a comment

martinmr left a comment

feat(Restore): Async restore operations. #5704

feat(Restore): Async restore operations. #5704

Conversation

martinmr commented Jun 22, 2020 • edited by dgraph-bot Loading

parasssh commented Jun 24, 2020

martinmr left a comment

Choose a reason for hiding this comment

parasssh left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

parasssh left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

parasssh left a comment

Choose a reason for hiding this comment

martinmr left a comment

Choose a reason for hiding this comment

martinmr commented Jun 22, 2020 •

edited by dgraph-bot

Loading