-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSH Session recording modes #12916
SSH Session recording modes #12916
Conversation
This should be handled already by the |
@@ -1532,6 +1532,44 @@ func (set RoleSet) CertificateExtensions() []*types.CertExtension { | |||
return exts | |||
} | |||
|
|||
// SessionRecordingMode returns the recording mode for a specific service. | |||
func (set RoleSet) SessionRecordingMode(service constants.SessionRecordingService) constants.SessionRecordingMode { | |||
defaultValue := constants.SessionRecordingModeBestEffort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default to Strict
so that BestEffort
is an opt-in feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In RFD we decided to use best-effort:
In addition, there will be a default entry for setting a default value used when sessions kind doesn't have one set. Currently, Teleport doesn't prevent sessions from happening if there is a failure on the recording. To keep this behavior after introducing the recording modes, the default value will be set to best_effort.
to preserve current behaviour. I agree that we should consider setting strict mode as default to force and ensure that by default audit logs from the session will be emitted.
@r0mant @gabrielcorado WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed it on the product meeting before and decided to default to best-effort to keep existing behavior for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that Strict
is the current behavior though -
Lines 529 to 535 in e8bfe2c
sess.io.OnWriteError = func(idString string, err error) { | |
if idString == sessionRecorderID { | |
sess.log.Error("Failed to write to session recorder, stopping session.") | |
// stop in goroutine to avoid deadlock | |
go sess.Stop() | |
} | |
} |
Previously a session recording error would freeze the session, and with my recent changes it terminates the session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we default to strict, then people will still get locked out of their nodes when they run out of disk space (which was the primary motivator for these changes) by default. IIRC this was the reason we chose best-effort as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I agree that best-effort would be a more reasonable default, but the issue remains that this will change the default behavior which is strict, and it seems like this decision was made under the presumption that the current behavior is best-effort.
Can we update the RFD and make sure to express this default behavior change in the release notes? Like Marek said, the current assumption by users would be that an audit write failure, due to disk issues or otherwise, would result in the session being frozen/terminated.
Alternatively we could make old roles default to strict
and new roles default to best-effort
.
// Open a BPF recording session. If BPF was not configured, not available, | ||
// or running in a recording proxy, OpenSession is a NOP. | ||
s.bpfContext = &bpf.SessionContext{ | ||
Context: ctx.srv.Context(), | ||
PID: s.term.PID(), | ||
Emitter: s.recorder, | ||
Emitter: s.Recorder(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will enhanced session recording respond to the recorder being closed? Should it be included in the best-effort
approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For best-effort scenarios, it is not affected. The only thing is that it will still try to emit events even after the session is entered in "no recording mode", producing warning messages during the entire session.
The session will still be closed in strict mode when emitting data events (which still happen even when the session has enhanced recording).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. I have left some comments.
Could you also take a look at UT ?
lib/services/role.go:1552:18: undefined: constants.SessionRecordingServiceKubernetes
lib/services/role.go:1553:29: recordSession.Kubernetes undefined (type *"github.com/gravitational/teleport/api/types".RecordSession has no field or method Kubernetes)```
_, err := rand.Read(buf) | ||
if err != nil { | ||
return trace.Wrap(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to fill the buf with random data ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if writing an "empty" buffer would work since I've tested with Truncate
, and it doesn't allocate space. After some tests, it is working as expected. I'll remove this part; we can discuss this better if anything appears on the integration tests.
|
||
err := w.proto.cfg.Uploader.ReserveUploadPart(w.proto.cancelCtx, w.proto.cfg.Upload, w.lastPartNumber) | ||
if err != nil { | ||
return nil, trace.ConnectionProblem(err, uploaderReservePartErrorMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ReserveUploadPart is really ConnectionProblem
related error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a better trace
error for this. Do you have any suggestions?
In addition, some uploader implementations, for example, the S3 uploader, will probably raise a connection problem on the ReserveUploadPart
calls.
return eventOnlyRec, nil | ||
} | ||
|
||
return nil, trace.ConnectionProblem(err, sessionRecordingErrorMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Does error on events.NewAuditWriter
is really a ConnectinProblem
kind of error ?
lib/srv/sess.go
Outdated
if idString == sessionRecorderID { | ||
switch s.scx.Identity.RoleSet.SessionRecordingMode(constants.SessionRecordingServiceSSH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idString == sessionRecorderID
means that error was originated from SessionRecorded but in s.BroadcastSystemMessage(
we will retry this and consequently the Write
(broadcast) call for sessionRecorderID will also fail and onWriteError
function will try ca onWriteError second time that will create a callback infinity loop between BroadcastSystemMessage->onWriteError(sessionRecorderID->BroadcastSystemMessage->nWriteError(sessionRecorderID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TermManager
Write/BroadcastMessage functions use a lock to access the writer's list. Meaning that if we directly call a Writer inside this callback, it will be a deadlock; that's why both BroadcastSystemMessage
and Close
are called inside a goroutine.
After that, when the writer has an error, it is removed from the TermManager
writers list; in the case where we have only one sessionRecorderID
this callback can only be called once.
if err != nil { | ||
return trace.ConnectionProblem(err, "failed to recreate audit events recorder") | ||
} | ||
s.setRecorder(newRecorder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to close old recorder ? rec.Close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not needed since the replacement will only happen when the recorder is already canceled (closed).
case <-rec.Done(): | ||
newRecorder, err := newEventOnlyRecorder(s, s.scx) | ||
if err != nil { | ||
return trace.ConnectionProblem(err, "failed to recreate audit events recorder") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that recorder was closed due to uploaderReservePart error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we don't need to know the reason it is closed. This function will only try to emit the events even if the current recorder is closed. We can rename it to tryEmitAuditEvent
to make it more clear.
@Joerger Right. Do you think that it is worth changing |
Sure, though I'm not sure if it's necessary. Right now it will just keep attempting to upload abandoned recordings (every 10 minutes) regardless of any errors it runs into. Would you instead just delete the recording from disk when encountering certain errors with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so long as we update the RFD to reflect that defaulting to best_effort
is an intentional behavior change.
Implements the Session recording modes RFD for SSH sessions.
The overall change consists in: reserving disk space for audit events and deciding how to proceed if this reservation fails.
Note: This PR is missing an integration test, I'm still working on it, and when it's ready, it will be pushed to this PR.
Here is an overview of what was changed on each file:
lib/events/stream.go
:ProtoStream
now callsReserveUploadPart
when initializing its internal slice. With this change, it is now able to identify disk issues before emitting events;ProtoStream
initializes the first slice on its initialization instead of when receiving the first event. With this change resuming or creating new streams will fail immediately.newSlice
function will cause the stream to be closed;lib/events/auditwriter.go
: Depending on which error theAuditWriter
receives from the stream, it will not try to recover it.lib/srv/sess.go
best_effort
;emitAuditEvent
function. This function is capable of replacing the recorder (similar to what is done at the session start);recorder
has its ownsync.RWMutex
. This mutex is used to replace the session recorder. It is dedicated to the recorder to avoid locking the session struct when emitting events;lib/events/filesessions/filestream.go
ReserveUploadPart
, which will create a file and fill it with the minimum bytes required by the stream (this size is defined using the stream constants);Next steps: