Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PX Resource Gateway #1621

Open
wants to merge 13 commits into
base: release-24.2.0
Choose a base branch
from
Open

PX Resource Gateway #1621

wants to merge 13 commits into from

Conversation

dgoel-px
Copy link

@dgoel-px dgoel-px commented Aug 5, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes (optional)
Closes #

Special notes for your reviewer:

Copy link
Member

@lpabon lpabon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing...

Dockerfile.proto Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Copy link
Member

@lpabon lpabon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuing...

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
cmd/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
cmd/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
Copy link
Member

@lpabon lpabon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add unit tests to all of these files. You also need e2e tests: create small grpc servers and have a client send requests to tests the full paths.

pkg/px-resource-gateway/priority_queue.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/priority_queue.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/priority_queue.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/priority_queue.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
};
}

// KeepAlive sends a heartbeat to the semaphore service to keep the lock alive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a timeout. Keep alive for "how long"? for example 5 secs? etc.

return &emptypb.Empty{}, nil
}

func (s *server) KeepAlive(ctx context.Context, req *pb.KeepAliveRequest) (*emptypb.Empty, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is confusing on the side effects and expectations. I added a comment to the .proto file on this rpc. Maybe add more comments on how this function works and expectations as a comment on the .proto file

// check if the node is at the front of the queue
nextResouceId, priority := s.priorityQ.Front()
if nextResouceId == "" {
panic("Queue is empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something else other than panic here? I would suggest handling this situation

@@ -0,0 +1,310 @@
package server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the algorithm as comment header in this file to help maintainers (and me as the reviewer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skipping this file until the algorithm is included.

Signed-off-by: Divyam Goel <[email protected]>
Signed-off-by: Divyam Goel <[email protected]>
Signed-off-by: Divyam Goel <[email protected]>
Copy link
Contributor

@piyush-nimbalkar piyush-nimbalkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UTs to the resource gateway installation and the actual server logic too before you merge the PR.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
deploy/crds/core_v1_storagecluster_crd.yaml Outdated Show resolved Hide resolved
deploy/crds/core_v1_storagecluster_crd.yaml Show resolved Hide resolved
@@ -109,6 +109,8 @@ const (
AnnotationPVCControllerSecurePort = pxAnnotationPrefix + "/pvc-controller-secure-port"
// AnnotationAutopilotCPU annotation for overriding the default CPU for Autopilot
AnnotationAutopilotCPU = pxAnnotationPrefix + "/autopilot-cpu"
// AnnotationPxResourceGatewayCPU annotation for overriding the default CPU for PxResourceGateway
AnnotationPxResourceGatewayCPU = pxAnnotationPrefix + "/px-resource-gateway-cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if we have resources section in the cluster spec?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this annotation? I see a couple of valid comments which are resolved w/o a comment. If you had an offline conversation can you mention them here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolved the comments because I had addressed them locally. I have pushed the changes now.

drivers/storage/portworx/component/px_resource_gateway.go Outdated Show resolved Hide resolved
drivers/storage/portworx/component/px_resource_gateway.go Outdated Show resolved Hide resolved
drivers/storage/portworx/component/px_resource_gateway.go Outdated Show resolved Hide resolved
drivers/storage/portworx/component/px_resource_gateway.go Outdated Show resolved Hide resolved
drivers/storage/portworx/component/px_resource_gateway.go Outdated Show resolved Hide resolved
@piyush-nimbalkar
Copy link
Contributor

I have looked at only the deployment logic. I am assuming @nrevanna and @lpabon are looking at the actual server implementation.

pkg/px-resource-gateway/priority_queue.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/priority_queue.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved
Signed-off-by: Divyam Goel <[email protected]>
Copy link

This PR is stale because it has been in review for 3 days with no activity.

Comment on lines 89 to 99
// Setup a signal handler
signal_handler := util.NewSigIntManager(func() {
pxResourceGatewayServer.Stop()
os.Remove(PxResourceGatewayServerSocket)
os.Exit(0)
})
err = signal_handler.Start()
if err != nil {
fmt.Printf("Unable to start signal handler: %v", err)
os.Exit(1)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also setup memory and cpu profiling, and then you can add a signal handler for SIGUSR1 to dump the heap and stack profile -

log.Infof("pprof profiling is enabled, creating profiling server at port %v", defaultPprofPort)

@@ -109,6 +109,8 @@ const (
AnnotationPVCControllerSecurePort = pxAnnotationPrefix + "/pvc-controller-secure-port"
// AnnotationAutopilotCPU annotation for overriding the default CPU for Autopilot
AnnotationAutopilotCPU = pxAnnotationPrefix + "/autopilot-cpu"
// AnnotationPxResourceGatewayCPU annotation for overriding the default CPU for PxResourceGateway
AnnotationPxResourceGatewayCPU = pxAnnotationPrefix + "/px-resource-gateway-cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this annotation? I see a couple of valid comments which are resolved w/o a comment. If you had an offline conversation can you mention them here.

pkg/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved

var (
NLocks = 10
ConfigMapUpdateInterval = time.Second * 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO 1 sec is still too aggressive for a single client to invoke a k8s api. In comparison, our default operator reconcilor will wake up every 30s and then make a bunch of k8s apis if required.

If the goal is to just persist the locks so that if this pod crashes we can recover then isn't 30s good enough?

pkg/px-resource-gateway/semaphore.go Outdated Show resolved Hide resolved
data[clientId] = fmt.Sprintf("%d", lockId)
}

if !isUpdateRequired(data, s.configMap.Data) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal to re-initialize the s.locks from s.configMap.Data when this process starts?
Right now, I don't see this config map's contents being used anywhere

pkg/px-resource-gateway/px_resource_gateway.go Outdated Show resolved Hide resolved
if _, ok := removedNode[clientId]; ok { // already removed
continue
}
s.priorityQ.Remove(clientId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing something here, why not remove a dead node unconditionally from the priorityQueue if it was part of s.locks ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants