Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agent): implement HTTP JFR snapshot creation #1627

Conversation

aali309
Copy link
Contributor

@aali309 aali309 commented Aug 23, 2023

Welcome to Cryostat! 👋

Before contributing, make sure you have:

  • Read the contributing guidelines
  • Linked a relevant issue which this PR resolves
  • Linked any other relevant issues, PR's, or documentation, if any
  • Resolved all conflicts, if any
  • Rebased your branch PR on top of the latest upstream main branch
  • Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
  • Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Fixes: #1613

Description of the change:

This change allows users to start snapshot via HTTP

Motivation for the change:

This continues the effort to bring Cryostat application feature parity between JMX and HTTP connection methods.

How to manually test:

  1. Same setup as in https://github.com/cryostatio/cryostat/pull/1566
  2. Test against the smoketest.sh http://localhost:9988/ agent target. Go to the Cryostat Web UI, Recordings, Create, and check that a recording can be started and a snapshot can be started.

@aali309 aali309 requested a review from tthvo August 23, 2023 15:52
@github-actions github-actions bot added the needs-triage Needs thorough attention from code reviewers label Aug 23, 2023
@aali309 aali309 requested review from maxcao13 and mwangggg August 23, 2023 15:52
@aali309 aali309 self-assigned this Aug 23, 2023
@aali309 aali309 added feat New feature or request safe-to-test and removed needs-triage Needs thorough attention from code reviewers labels Aug 23, 2023
@andrewazores andrewazores changed the base branch from main to 1578-epic-two-way-agent-communications August 23, 2023 15:54
@aali309
Copy link
Contributor Author

aali309 commented Aug 23, 2023

On this, what differentiates the endpoint between a recording and a snapshot recording? or the logic will be taken care of on the agent side depending on the template created? i.e custom or if a recordings exist to successfully create a snapshot? on the method handleStartRecordingOrSnapshot in cryostat-agentsrc/main/java/io/cryostat/agent/remote/RecordingsContext.java

@github-actions
Copy link
Contributor

Test image available:

$ CRYOSTAT_IMAGE=ghcr.io/cryostatio/cryostat:pr-1627-6785c3cb96771bbee84e41f6b4e17d44b2740d4d-linux-arm64 sh smoketest.sh

@github-actions
Copy link
Contributor

Test image available:

$ CRYOSTAT_IMAGE=ghcr.io/cryostatio/cryostat:pr-1627-6785c3cb96771bbee84e41f6b4e17d44b2740d4d-linux-amd64 sh smoketest.sh

@andrewazores
Copy link
Member

On this, what differentiates the endpoint between a recording and a snapshot recording? or the logic will be taken care of on the agent side depending on the template created? i.e custom or if a recordings exist to successfully create a snapshot? on the method handleStartRecordingOrSnapshot in cryostat-agentsrc/main/java/io/cryostat/agent/remote/RecordingsContext.java

This needs to be coordinated with Ming's PR on the Agent (in this case acting as the server): cryostatio/cryostat-agent#186

@aali309
Copy link
Contributor Author

aali309 commented Aug 23, 2023

On this, what differentiates the endpoint between a recording and a snapshot recording? or the logic will be taken care of on the agent side depending on the template created? i.e custom or if a recordings exist to successfully create a snapshot? on the method handleStartRecordingOrSnapshot in cryostat-agentsrc/main/java/io/cryostat/agent/remote/RecordingsContext.java

This needs to be coordinated with Ming's PR on the Agent (in this case acting as the server): cryostatio/cryostat-agent#186

I did.. Just confirming coz the endpoint is the same and the logic is taken care of in the handleStartRecordingOrSnapshot which @mwangggg made

@andrewazores
Copy link
Member

The endpoint is the same, but the Agent-side change does expect some particular request format in the payload to indicate that the request is for a snapshot:

https://github.com/cryostatio/cryostat-agent/pull/186/files#diff-fb08efc7a13d2b872d8abd6e8d6f983bc6e1960cb08979998949d371aa66803dR373

Copy link
Member

@andrewazores andrewazores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I think the only remaining thing to do is a one-liner change in AgentJFRService to call this method at the appropriate place (getSnapshotRecording() I think).

@andrewazores
Copy link
Member

@aali309 @mwangggg please test this out between the two of you and both sides of the changes. I think I am seeing a bug when I try to use this, but perhaps I just built the pieces wrong.

@mwangggg
Copy link
Member

I'm getting a 500 Internal Server Error too

@andrewazores
Copy link
Member

andrewazores commented Aug 24, 2023

IIRC what I saw in the web-client was a 500, but looking at the logs the cause of that was that Cryostat received a 400 from the Agent on the request.

Cryostat should probably have responded to the web-client with a 502 instead of a generic 500, but even better, the root cause should be addressed :)

@mwangggg
Copy link
Member

I think I know what the problem is- the IsValid() logic is incorrect on the agent side

@andrewazores
Copy link
Member

That's what I thought, too.

@andrewazores
Copy link
Member

SEVERE: HTTP 500: io.cryostat.net.AgentJFRService$UnimplementedException
io.vertx.ext.web.handler.HttpException: Internal Server Error
Caused by: java.util.concurrent.ExecutionException: io.cryostat.net.AgentJFRService$UnimplementedException
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
	at io.cryostat.net.web.http.api.v1.TargetSnapshotPostHandler.handleAuthenticated(TargetSnapshotPostHandler.java:92)
	at io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:80)
	at io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:50)
	at io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)
	at io.vertx.core.impl.ContextBase.lambda$null$0(ContextBase.java:137)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)
	at io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)
	at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.cryostat.net.AgentJFRService$UnimplementedException
	at io.cryostat.net.AgentJFRService.updateRecordingOptions(AgentJFRService.java:254)
	at io.cryostat.recordings.RecordingTargetHelper.lambda$createSnapshot$7(RecordingTargetHelper.java:287)
	at io.cryostat.net.TargetConnectionManager.executeConnectedTask(TargetConnectionManager.java:146)
	at io.cryostat.recordings.RecordingTargetHelper.createSnapshot(RecordingTargetHelper.java:268)
	... 12 more

Looks like there's a new/different error when I check now vs when I checked last. It looks like the snapshot request actually does go through successfully (also supported by examining the logs on both components):

image

But the Cryostat server also tries to invoke updateRecordingOptions() after creating the snapshot, and this operation is not implemented. The purpose of this operation is to change recordings' properties, in particular for this case the recording name, so it should be something we can implement on the Cryostat server and the Agent, too. If @aali309 and @mwangggg are interested, we could continue with that work in these same two PRs, since this extra operation is also required as part of the overall story of implementing snapshot capabilities.

@andrewazores
Copy link
Member

Basically, after creating the remote snapshot, the server also wants to be able to rename the recording. Since we have more control over this process when doing it over HTTP it might make sense to simply provide that name on the original request, rather than perform the request to create the snapshot and immediately try to rename it, but that would require some deeper refactoring and changing of logic since the JMX side does not support that flow. For simplicity I think we can just proceed with maintaining that same logic even over HTTP: create the snapshot first, then follow-up with a request to rename it.

The updateRecordingOptions method allows for the properties handled by the RecordingOptionsBuilder to be updated - these are generally the same properties that can be set when first creating a recording, so the name and duration in addition to the "Advanced Options" we expose in the UI (toDisk, maxAge, maxSize). I don't think we want/need to support the other properties.

I would suggest these might be done by the Cryostat server sending a PATCH /recordings/:id to the Agent, and the request body being JSON like { "propertyName": updatedValue, "propertyName2": updatedValue2 }, so more concretely in this case it would be { "name": "snapshot-rename" }. The Agent can accept this request, parse the body, and use the JSON structure to determine what changes it should apply to the recording with the given ID.

@aali309 aali309 force-pushed the atifs-httpJFRsnapshotCreation branch from f129c39 to 0709324 Compare August 27, 2023 23:45
Copy link

@mergify mergify bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request blocked. web-client submodule updates are performed automatically by CI when that repository is updated. Please revert or drop all changes to the web-client submodule from this PR and perform any required frontend work by opening and merging a PR against cryostat-web.

mergify[bot]
mergify bot previously requested changes Aug 27, 2023
Copy link

@mergify mergify bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request blocked. web-client submodule updates are performed automatically by CI when that repository is updated. Please revert or drop all changes to the web-client submodule from this PR and perform any required frontend work by opening and merging a PR against cryostat-web.

@aali309 aali309 force-pushed the atifs-httpJFRsnapshotCreation branch from 0709324 to 02b4dbb Compare August 28, 2023 01:32
@aali309
Copy link
Contributor Author

aali309 commented Aug 28, 2023

Need help debugging this

When I run test if snapShot is being created... It appears that it is created but I get the error 500
Request failed (500 Internal Server Error) org.openjdk.jmc.rjmx.services.jfr.FlightRecorderException: Failed to create snapshot recording caused by RuntimeException: Unknown failure
but checking the recordings, a snapshot is created
image
but viewing the report, this is what I see {"meta":{"type":"text/plain","status":"Internal Server Error"},"data":{"reason":"java.lang.NullPointerException: Cannot invoke \"io.vertx.core.buffer.Buffer.getBytes()\" because \"b\" is null"}}
if I am right the error is pointing at handleGetRecording method in the RecordingContext.java class in cryostat-agent

@andrewazores
Copy link
Member

andrewazores commented Aug 29, 2023

If you check the Cryostat server logs, you should be able to find the stack trace of that NullPointerException and this will show you the exact line of code where it is occurring, so you will be able to tell which Buffer b is null.

Observing Snapshot creation but then an HTTP 500 immediately afterward and the recording not being automatically renamed like snapshot-4 is due to the (snapshot) recording failing to update with new recording options, since the recording's name is technically one of its options.

@aali309 aali309 force-pushed the atifs-httpJFRsnapshotCreation branch 2 times, most recently from 942c548 to 151ba16 Compare September 7, 2023 17:43
@andrewazores
Copy link
Member

andrewazores commented Sep 11, 2023

image

image

@aali309 @mwangggg I haven't looked into which side is causing this, but it looks to me like the rename attempt is failing because there are quotation marks surrounding the new name.

So from the evidence in these screenshots it seems:

  1. the Cryostat server asks the Agent to create a new snapshot
  2. the Agent creates the snapshot with its default name (just "Snapshot")
  3. the Cryostat server asks the Agent to rename that recording to "snapshot-n" where n is a numeric ID. The rename occurs, but either the server actually asked for "\"snapshot-n\"" or the agent interpreted it that way
  4. the Cryostat server then checks with the Agent whether there is a recording name "snapshot-n" (or "\"snapshot-n\""), and this fails, either because the server asked for the wrong thing or the Agent interpreted it the wrong way

https://github.com/cryostatio/cryostat/blob/198c811d777eae7472ccb124349049c39068c947/src/main/java/io/cryostat/recordings/RecordingTargetHelper.java#L339

This check is almost the last thing the server does before sending an OK response to the original client: https://github.com/cryostatio/cryostat/blob/198c811d777eae7472ccb124349049c39068c947/src/main/java/io/cryostat/net/web/http/api/v1/TargetSnapshotPostHandler.java#L92

After that succeeds the server will try to open the remote snapshot stream to verify that the snapshot actually has contents, which uses the same remote streaming API that we have for downloading recordings and which already works. If the verification fails then the server will ask the agent to delete the recording, which also should already work. So I think this rename operation bug is the last hurdle to getting this feature across the finish line.

@andrewazores andrewazores merged commit c3559a9 into cryostatio:1578-epic-two-way-agent-communications Sep 13, 2023
andrewazores added a commit that referenced this pull request Sep 13, 2023
@aali309 aali309 deleted the atifs-httpJFRsnapshotCreation branch September 13, 2023 17:18
@aali309 aali309 restored the atifs-httpJFRsnapshotCreation branch September 13, 2023 17:18
@aali309 aali309 deleted the atifs-httpJFRsnapshotCreation branch September 13, 2023 17:31
andrewazores added a commit that referenced this pull request Sep 15, 2023
andrewazores added a commit that referenced this pull request Sep 18, 2023
andrewazores added a commit that referenced this pull request Sep 18, 2023
andrewazores added a commit that referenced this pull request Sep 19, 2023
andrewazores added a commit that referenced this pull request Sep 19, 2023
* feat(agent): implement Agent HTTP dynamic JFR start (#1566)

* chore(svc): extract EventOptionsBuilder to -core and use new CryostatFlightRecorderService

* test(smoketest): enable API writes on one agent-equipped sample app

* chore(serial): extract recording descriptor to -core

* chore(activerecordings): clean up an error handler

* feat(agent): implement dynamic start of JFR over HTTP

* bump -core version

* feat(agent): implement Agent HTTP dynamic JFR stop/delete (#1604)

* feat(agent): implement Agent HTTP recording retrieval (#1607)

* feat(agent): implement HTTP JFR snapshot creation (#1627)

Co-authored-by: Atif Ali <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request safe-to-test
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Story] Implement HTTP JFR snapshot creation
3 participants