Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH Session recording modes #12916

Merged
merged 27 commits into from
Jun 6, 2022
Merged

Conversation

gabrielcorado
Copy link
Contributor

@gabrielcorado gabrielcorado commented May 25, 2022

Implements the Session recording modes RFD for SSH sessions.

The overall change consists in: reserving disk space for audit events and deciding how to proceed if this reservation fails.

  • For strict mode, stop the session immediately;
  • And, for best-effort mode, keep the session and discard session data events, avoiding disk I/O;

Note: This PR is missing an integration test, I'm still working on it, and when it's ready, it will be pushed to this PR.

Here is an overview of what was changed on each file:

  • lib/events/stream.go:
    • ProtoStream now calls ReserveUploadPart when initializing its internal slice. With this change, it is now able to identify disk issues before emitting events;
    • ProtoStream initializes the first slice on its initialization instead of when receiving the first event. With this change resuming or creating new streams will fail immediately.
    • Failures on the newSlice function will cause the stream to be closed;
  • lib/events/auditwriter.go: Depending on which error the AuditWriter receives from the stream, it will not try to recover it.
  • lib/srv/sess.go
    • Changed the recorder initialization. Now there is a fallback that is used when the initial audit writer fails during initialization, and the session is set to best_effort;
    • All events emitted on the session now use the emitAuditEvent function. This function is capable of replacing the recorder (similar to what is done at the session start);
    • The recorder has its own sync.RWMutex. This mutex is used to replace the session recorder. It is dedicated to the recorder to avoid locking the session struct when emitting events;
  • lib/events/filesessions/filestream.go
    • Implements the ReserveUploadPart, which will create a file and fill it with the minimum bytes required by the stream (this size is defined using the stream constants);

Next steps:

  • When the session is terminated, there can leave some upload parts on the folder. The initial idea is to create a routine to try to upload those files. This will be done in a separate PR;
  • Implement session recording mode in Kubernetes sessions.

@gabrielcorado gabrielcorado self-assigned this May 25, 2022
@github-actions github-actions bot requested review from espadolini and tcsc May 25, 2022 18:15
@github-actions github-actions bot added the audit-log Issues related to Teleports Audit Log label May 25, 2022
@Joerger
Copy link
Contributor

Joerger commented May 25, 2022

Next steps:

When the session is terminated, there can leave some upload parts on the folder. The initial idea is to create a routine to try to upload those files. This will be done in a separate PR;

This should be handled already by the UploadCompleter, which runs on each recording service (node, proxy, etc) to complete any abandoned uploads. It currently waits a full 24 hours before attempting to complete an upload, but once we can deprecate that grace period in v11, the completer will instead use the SessionTrackerService determine if an upload is abandoned - #11551

api/types/types.proto Outdated Show resolved Hide resolved
lib/events/auditwriter.go Outdated Show resolved Hide resolved
lib/events/stream.go Show resolved Hide resolved
lib/services/role.go Outdated Show resolved Hide resolved
@@ -1532,6 +1532,44 @@ func (set RoleSet) CertificateExtensions() []*types.CertExtension {
return exts
}

// SessionRecordingMode returns the recording mode for a specific service.
func (set RoleSet) SessionRecordingMode(service constants.SessionRecordingService) constants.SessionRecordingMode {
defaultValue := constants.SessionRecordingModeBestEffort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to Strict so that BestEffort is an opt-in feature?

Copy link
Contributor

@smallinsky smallinsky May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In RFD we decided to use best-effort:

In addition, there will be a default entry for setting a default value used when sessions kind doesn't have one set. Currently, Teleport doesn't prevent sessions from happening if there is a failure on the recording. To keep this behavior after introducing the recording modes, the default value will be set to best_effort.

to preserve current behaviour. I agree that we should consider setting strict mode as default to force and ensure that by default audit logs from the session will be emitted.

@r0mant @gabrielcorado WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed it on the product meeting before and decided to default to best-effort to keep existing behavior for users.

Copy link
Contributor

@Joerger Joerger May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that Strict is the current behavior though -

teleport/lib/srv/sess.go

Lines 529 to 535 in e8bfe2c

sess.io.OnWriteError = func(idString string, err error) {
if idString == sessionRecorderID {
sess.log.Error("Failed to write to session recorder, stopping session.")
// stop in goroutine to avoid deadlock
go sess.Stop()
}
}

Previously a session recording error would freeze the session, and with my recent changes it terminates the session.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we default to strict, then people will still get locked out of their nodes when they run out of disk space (which was the primary motivator for these changes) by default. IIRC this was the reason we chose best-effort as default.

Copy link
Contributor

@Joerger Joerger May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I agree that best-effort would be a more reasonable default, but the issue remains that this will change the default behavior which is strict, and it seems like this decision was made under the presumption that the current behavior is best-effort.

Can we update the RFD and make sure to express this default behavior change in the release notes? Like Marek said, the current assumption by users would be that an audit write failure, due to disk issues or otherwise, would result in the session being frozen/terminated.

Alternatively we could make old roles default to strict and new roles default to best-effort.

lib/srv/sess.go Outdated Show resolved Hide resolved
// Open a BPF recording session. If BPF was not configured, not available,
// or running in a recording proxy, OpenSession is a NOP.
s.bpfContext = &bpf.SessionContext{
Context: ctx.srv.Context(),
PID: s.term.PID(),
Emitter: s.recorder,
Emitter: s.Recorder(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will enhanced session recording respond to the recorder being closed? Should it be included in the best-effort approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For best-effort scenarios, it is not affected. The only thing is that it will still try to emit events even after the session is entered in "no recording mode", producing warning messages during the entire session.

The session will still be closed in strict mode when emitting data events (which still happen even when the session has enhanced recording).

Copy link
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. I have left some comments.
Could you also take a look at UT ?

lib/services/role.go:1552:18: undefined: constants.SessionRecordingServiceKubernetes
lib/services/role.go:1553:29: recordSession.Kubernetes undefined (type *"github.com/gravitational/teleport/api/types".RecordSession has no field or method Kubernetes)```

api/constants/constants.go Outdated Show resolved Hide resolved
Comment on lines 274 to 277
_, err := rand.Read(buf)
if err != nil {
return trace.Wrap(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fill the buf with random data ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if writing an "empty" buffer would work since I've tested with Truncate, and it doesn't allocate space. After some tests, it is working as expected. I'll remove this part; we can discuss this better if anything appears on the integration tests.

lib/events/s3sessions/s3stream.go Outdated Show resolved Hide resolved
lib/events/auditwriter_test.go Outdated Show resolved Hide resolved

err := w.proto.cfg.Uploader.ReserveUploadPart(w.proto.cancelCtx, w.proto.cfg.Upload, w.lastPartNumber)
if err != nil {
return nil, trace.ConnectionProblem(err, uploaderReservePartErrorMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ReserveUploadPart is really ConnectionProblem related error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a better trace error for this. Do you have any suggestions?

In addition, some uploader implementations, for example, the S3 uploader, will probably raise a connection problem on the ReserveUploadPart calls.

return eventOnlyRec, nil
}

return nil, trace.ConnectionProblem(err, sessionRecordingErrorMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Does error on events.NewAuditWriter is really a ConnectinProblem kind of error ?

lib/srv/sess.go Outdated
Comment on lines 1747 to 1748
if idString == sessionRecorderID {
switch s.scx.Identity.RoleSet.SessionRecordingMode(constants.SessionRecordingServiceSSH) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idString == sessionRecorderID means that error was originated from SessionRecorded but in s.BroadcastSystemMessage( we will retry this and consequently the Write (broadcast) call for sessionRecorderID will also fail and onWriteError function will try ca onWriteError second time that will create a callback infinity loop between BroadcastSystemMessage->onWriteError(sessionRecorderID->BroadcastSystemMessage->nWriteError(sessionRecorderID

Copy link
Contributor Author

@gabrielcorado gabrielcorado May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TermManager Write/BroadcastMessage functions use a lock to access the writer's list. Meaning that if we directly call a Writer inside this callback, it will be a deadlock; that's why both BroadcastSystemMessage and Close are called inside a goroutine.

After that, when the writer has an error, it is removed from the TermManager writers list; in the case where we have only one sessionRecorderID this callback can only be called once.

lib/srv/sess.go Outdated Show resolved Hide resolved
if err != nil {
return trace.ConnectionProblem(err, "failed to recreate audit events recorder")
}
s.setRecorder(newRecorder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to close old recorder ? rec.Close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed since the replacement will only happen when the recorder is already canceled (closed).

Comment on lines +1732 to +1736
case <-rec.Done():
newRecorder, err := newEventOnlyRecorder(s, s.scx)
if err != nil {
return trace.ConnectionProblem(err, "failed to recreate audit events recorder")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that recorder was closed due to uploaderReservePart error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don't need to know the reason it is closed. This function will only try to emit the events even if the current recorder is closed. We can rename it to tryEmitAuditEvent to make it more clear.

@gabrielcorado
Copy link
Contributor Author

This should be handled already by the UploadCompleter, which runs on each recording service (node, proxy, etc) to complete any abandoned uploads. It currently waits a full 24 hours before attempting to complete an upload, but once we can deprecate that grace period in v11, the completer will instead use the SessionTrackerService determine if an upload is abandoned - #11551

@Joerger Right. Do you think that it is worth changing UploadCompleter to deal with disk issues as well?

@Joerger
Copy link
Contributor

Joerger commented May 31, 2022

Do you think that it is worth changing UploadCompleter to deal with disk issues as well?

Sure, though I'm not sure if it's necessary. Right now it will just keep attempting to upload abandoned recordings (every 10 minutes) regardless of any errors it runs into. Would you instead just delete the recording from disk when encountering certain errors with CompleteUpload, and what errors would those be?

@espadolini espadolini removed their request for review June 1, 2022 11:19
Copy link
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nit comments but mostly looks good for me when remaining comments will be addressed. @Joerger @jakule Could you also take a look ?

lib/events/filesessions/filestream.go Outdated Show resolved Hide resolved
lib/events/filesessions/filestream.go Outdated Show resolved Hide resolved
lib/events/stream.go Outdated Show resolved Hide resolved
lib/utils/fs.go Outdated Show resolved Hide resolved
lib/events/stream.go Outdated Show resolved Hide resolved
lib/events/stream.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Joerger Joerger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so long as we update the RFD to reflect that defaulting to best_effort is an intentional behavior change.

@gabrielcorado gabrielcorado enabled auto-merge (squash) June 6, 2022 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit-log Issues related to Teleports Audit Log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants