SSH Session recording modes #12916

gabrielcorado · 2022-05-25T18:14:38Z

Implements the Session recording modes RFD for SSH sessions.

The overall change consists in: reserving disk space for audit events and deciding how to proceed if this reservation fails.

For strict mode, stop the session immediately;
And, for best-effort mode, keep the session and discard session data events, avoiding disk I/O;

Note: This PR is missing an integration test, I'm still working on it, and when it's ready, it will be pushed to this PR.

Here is an overview of what was changed on each file:

lib/events/stream.go:
- ProtoStream now calls ReserveUploadPart when initializing its internal slice. With this change, it is now able to identify disk issues before emitting events;
- ProtoStream initializes the first slice on its initialization instead of when receiving the first event. With this change resuming or creating new streams will fail immediately.
- Failures on the newSlice function will cause the stream to be closed;
lib/events/auditwriter.go: Depending on which error the AuditWriter receives from the stream, it will not try to recover it.
lib/srv/sess.go
- Changed the recorder initialization. Now there is a fallback that is used when the initial audit writer fails during initialization, and the session is set to best_effort;
- All events emitted on the session now use the emitAuditEvent function. This function is capable of replacing the recorder (similar to what is done at the session start);
- The recorder has its own sync.RWMutex. This mutex is used to replace the session recorder. It is dedicated to the recorder to avoid locking the session struct when emitting events;
lib/events/filesessions/filestream.go
- Implements the ReserveUploadPart, which will create a file and fill it with the minimum bytes required by the stream (this size is defined using the stream constants);

Next steps:

When the session is terminated, there can leave some upload parts on the folder. The initial idea is to create a routine to try to upload those files. This will be done in a separate PR;
Implement session recording mode in Kubernetes sessions.

Joerger · 2022-05-25T20:54:33Z

Next steps:

When the session is terminated, there can leave some upload parts on the folder. The initial idea is to create a routine to try to upload those files. This will be done in a separate PR;

This should be handled already by the UploadCompleter, which runs on each recording service (node, proxy, etc) to complete any abandoned uploads. It currently waits a full 24 hours before attempting to complete an upload, but once we can deprecate that grace period in v11, the completer will instead use the SessionTrackerService determine if an upload is abandoned - #11551

api/types/types.proto

lib/events/auditwriter.go

lib/events/stream.go

lib/services/role.go

Joerger · 2022-05-25T20:27:23Z

lib/services/role.go

@@ -1532,6 +1532,44 @@ func (set RoleSet) CertificateExtensions() []*types.CertExtension {
 	return exts
 }

+// SessionRecordingMode returns the recording mode for a specific service.
+func (set RoleSet) SessionRecordingMode(service constants.SessionRecordingService) constants.SessionRecordingMode {
+	defaultValue := constants.SessionRecordingModeBestEffort


Should this default to Strict so that BestEffort is an opt-in feature?

In RFD we decided to use best-effort:

In addition, there will be a default entry for setting a default value used when sessions kind doesn't have one set. Currently, Teleport doesn't prevent sessions from happening if there is a failure on the recording. To keep this behavior after introducing the recording modes, the default value will be set to best_effort.

to preserve current behaviour. I agree that we should consider setting strict mode as default to force and ensure that by default audit logs from the session will be emitted.

@r0mant @gabrielcorado WDYT ?

We discussed it on the product meeting before and decided to default to best-effort to keep existing behavior for users.

I believe that Strict is the current behavior though -

teleport/lib/srv/sess.go

Lines 529 to 535 in e8bfe2c

sess.io.OnWriteError = func(idString string, err error) {

if idString == sessionRecorderID {

sess.log.Error("Failed to write to session recorder, stopping session.")

// stop in goroutine to avoid deadlock

go sess.Stop()

}

}

Previously a session recording error would freeze the session, and with my recent changes it terminates the session.

If we default to strict, then people will still get locked out of their nodes when they run out of disk space (which was the primary motivator for these changes) by default. IIRC this was the reason we chose best-effort as default.

Hmmm I agree that best-effort would be a more reasonable default, but the issue remains that this will change the default behavior which is strict, and it seems like this decision was made under the presumption that the current behavior is best-effort.

Can we update the RFD and make sure to express this default behavior change in the release notes? Like Marek said, the current assumption by users would be that an audit write failure, due to disk issues or otherwise, would result in the session being frozen/terminated.

Alternatively we could make old roles default to strict and new roles default to best-effort.

lib/srv/sess.go

Joerger · 2022-05-25T20:35:46Z

lib/srv/sess.go

 	// Open a BPF recording session. If BPF was not configured, not available,
 	// or running in a recording proxy, OpenSession is a NOP.
 	s.bpfContext = &bpf.SessionContext{
 		Context:   ctx.srv.Context(),
 		PID:       s.term.PID(),
-		Emitter:   s.recorder,
+		Emitter:   s.Recorder(),


How will enhanced session recording respond to the recorder being closed? Should it be included in the best-effort approach?

For best-effort scenarios, it is not affected. The only thing is that it will still try to emit events even after the session is entered in "no recording mode", producing warning messages during the entire session.

The session will still be closed in strict mode when emitting data events (which still happen even when the session has enhanced recording).

smallinsky

First pass. I have left some comments.
Could you also take a look at UT ?

lib/services/role.go:1552:18: undefined: constants.SessionRecordingServiceKubernetes
lib/services/role.go:1553:29: recordSession.Kubernetes undefined (type *"github.com/gravitational/teleport/api/types".RecordSession has no field or method Kubernetes)```

api/constants/constants.go

smallinsky · 2022-05-26T09:19:26Z

lib/events/filesessions/filestream.go

+	_, err := rand.Read(buf)
+	if err != nil {
+		return trace.Wrap(err)
+	}


Do we need to fill the buf with random data ?

I wasn't sure if writing an "empty" buffer would work since I've tested with Truncate, and it doesn't allocate space. After some tests, it is working as expected. I'll remove this part; we can discuss this better if anything appears on the integration tests.

lib/events/s3sessions/s3stream.go

lib/events/auditwriter_test.go

smallinsky · 2022-05-26T09:41:16Z

lib/events/stream.go

+
+	err := w.proto.cfg.Uploader.ReserveUploadPart(w.proto.cancelCtx, w.proto.cfg.Upload, w.lastPartNumber)
+	if err != nil {
+		return nil, trace.ConnectionProblem(err, uploaderReservePartErrorMessage)


Does ReserveUploadPart is really ConnectionProblem related error ?

I couldn't find a better trace error for this. Do you have any suggestions?

In addition, some uploader implementations, for example, the S3 uploader, will probably raise a connection problem on the ReserveUploadPart calls.

smallinsky · 2022-05-26T09:56:17Z

lib/srv/sess.go

+			return eventOnlyRec, nil
+		}
+
+		return nil, trace.ConnectionProblem(err, sessionRecordingErrorMessage)


Same here. Does error on events.NewAuditWriter is really a ConnectinProblem kind of error ?

smallinsky · 2022-05-26T10:53:08Z

lib/srv/sess.go

+	if idString == sessionRecorderID {
+		switch s.scx.Identity.RoleSet.SessionRecordingMode(constants.SessionRecordingServiceSSH) {


idString == sessionRecorderID means that error was originated from SessionRecorded but in s.BroadcastSystemMessage( we will retry this and consequently the Write (broadcast) call for sessionRecorderID will also fail and onWriteError function will try ca onWriteError second time that will create a callback infinity loop between BroadcastSystemMessage->onWriteError(sessionRecorderID->BroadcastSystemMessage->nWriteError(sessionRecorderID

The TermManager Write/BroadcastMessage functions use a lock to access the writer's list. Meaning that if we directly call a Writer inside this callback, it will be a deadlock; that's why both BroadcastSystemMessage and Close are called inside a goroutine.

After that, when the writer has an error, it is removed from the TermManager writers list; in the case where we have only one sessionRecorderID this callback can only be called once.

lib/srv/sess.go

smallinsky · 2022-05-26T10:57:55Z

lib/srv/sess.go

+		if err != nil {
+			return trace.ConnectionProblem(err, "failed to recreate audit events recorder")
+		}
+		s.setRecorder(newRecorder)


Don't we need to close old recorder ? rec.Close()

It is not needed since the replacement will only happen when the recorder is already canceled (closed).

smallinsky · 2022-05-26T11:00:28Z

lib/srv/sess.go

+	case <-rec.Done():
+		newRecorder, err := newEventOnlyRecorder(s, s.scx)
+		if err != nil {
+			return trace.ConnectionProblem(err, "failed to recreate audit events recorder")
+		}


How do we know that recorder was closed due to uploaderReservePart error ?

Here we don't need to know the reason it is closed. This function will only try to emit the events even if the current recorder is closed. We can rename it to tryEmitAuditEvent to make it more clear.

gabrielcorado · 2022-05-31T14:29:42Z

This should be handled already by the UploadCompleter, which runs on each recording service (node, proxy, etc) to complete any abandoned uploads. It currently waits a full 24 hours before attempting to complete an upload, but once we can deprecate that grace period in v11, the completer will instead use the SessionTrackerService determine if an upload is abandoned - #11551

@Joerger Right. Do you think that it is worth changing UploadCompleter to deal with disk issues as well?

Joerger · 2022-05-31T19:40:57Z

Do you think that it is worth changing UploadCompleter to deal with disk issues as well?

Sure, though I'm not sure if it's necessary. Right now it will just keep attempting to upload abandoned recordings (every 10 minutes) regardless of any errors it runs into. Would you instead just delete the recording from disk when encountering certain errors with CompleteUpload, and what errors would those be?

This reverts commit 3eac532.

smallinsky

Left some nit comments but mostly looks good for me when remaining comments will be addressed. @Joerger @jakule Could you also take a look ?

lib/events/filesessions/filestream.go

lib/events/stream.go

lib/utils/fs.go

lib/events/stream.go

Joerger

LGTM so long as we update the RFD to reflect that defaulting to best_effort is an intentional behavior change.

feat: session recording modes

81c00c0

gabrielcorado requested review from Joerger, xacrimon and smallinsky May 25, 2022 18:14

gabrielcorado self-assigned this May 25, 2022

github-actions bot requested review from espadolini and tcsc May 25, 2022 18:15

github-actions bot added the audit-log Issues related to Teleports Audit Log label May 25, 2022

Joerger reviewed May 25, 2022

View reviewed changes

smallinsky reviewed May 26, 2022

View reviewed changes

gabrielcorado added 10 commits May 26, 2022 12:07

fix(services): remove undefined constant

127a522

refactor(types): change attribute number

2bf5753

chore: fix typos

f04ad76

refactor(events): change receiveAndUpload to return error

80f10ff

refactor(srv): change complete session recording call

0e8fabe

chore(constants): remove unused constant

94b5195

feat(srv): wrap errors

5098f6b

chore: fix lint and tests

ee31f65

chore(srv): add missing parameter

78a4669

refactor(filesessions): remove random data generation

986e4ba

gabrielcorado added 4 commits May 31, 2022 11:36

refactor(events): change last number part increment

3eac532

chore(filesessions): fix license

83ad1af

fix(events): increment the last part after upload is complete

e267454

refactor(events): change cancel call on auditwriter

9c26d2d

espadolini removed their request for review June 1, 2022 11:19

gabrielcorado added 2 commits June 1, 2022 10:42

fix(srv): use proper identity when creating session recorder

bbeed58

Revert "refactor(events): change last number part increment"

cddc48c

This reverts commit 3eac532.

tests(integration): add session recording mode

617530c

smallinsky reviewed Jun 3, 2022

View reviewed changes

gabrielcorado added 3 commits June 3, 2022 10:43

chore: apply review suggestions

6745fa0

refactor(filesessions): generate buffer after file is opened

2632353

Merge branch 'master' into gabrielcorado/session-recording-modes

bfbd7e1

smallinsky approved these changes Jun 3, 2022

View reviewed changes

Joerger mentioned this pull request Jun 3, 2022

RFD 68: Session recording modes #11631

Merged

Joerger approved these changes Jun 3, 2022

View reviewed changes

r0mant added the merge-for-v10 label Jun 6, 2022

Merge branch 'master' into gabrielcorado/session-recording-modes

7d97b18

gabrielcorado enabled auto-merge (squash) June 6, 2022 17:20

gabrielcorado added 5 commits June 6, 2022 14:45

chore(events): fix lint

c289e27

chore(events): fix lint

38b8554

Merge branch 'master' into gabrielcorado/session-recording-modes

3f275af

Merge branch 'master' into gabrielcorado/session-recording-modes

45cecc0

Merge branch 'master' into gabrielcorado/session-recording-modes

2f13359

gabrielcorado merged commit c459ddb into master Jun 6, 2022

This was referenced Jun 7, 2022

Add SSH session recording modes to documentation #13257

Merged

Disk reservation on Teleport servers to make sure access is not lost due to lack of space #7388

Closed

Joerger mentioned this pull request Jun 17, 2022

Warning - Skipped session recording #13640

Closed

smallinsky mentioned this pull request Jun 22, 2022

Session Recording modes for database application and desktop access #13727

Open

gabrielcorado mentioned this pull request Jun 23, 2022

Skip session recording reservation files (filessesion) #13826

Merged

gabrielcorado deleted the gabrielcorado/session-recording-modes branch November 2, 2022 00:24

zmb3 mentioned this pull request Nov 15, 2022

"database or disk is full" ~ How I killed my Teleport Instance. #5424

Closed

rosstimothy added a commit that referenced this pull request Dec 20, 2022

backport part of #12916 to get tests to pass

12a9755

rosstimothy mentioned this pull request Dec 20, 2022

[v9] Prevent "session.start" from being overwritten by "session.exec" #19499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSH Session recording modes #12916

SSH Session recording modes #12916

gabrielcorado commented May 25, 2022 •

edited by Joerger

Loading

Joerger commented May 25, 2022

Joerger May 25, 2022

smallinsky May 31, 2022 •

edited

Loading

r0mant May 31, 2022

Joerger May 31, 2022 •

edited

Loading

r0mant May 31, 2022

Joerger May 31, 2022 •

edited

Loading

Joerger May 25, 2022

gabrielcorado Jun 6, 2022

smallinsky left a comment

smallinsky May 26, 2022

gabrielcorado May 31, 2022

smallinsky May 26, 2022

gabrielcorado May 31, 2022

smallinsky May 26, 2022

smallinsky May 26, 2022

gabrielcorado May 30, 2022 •

edited

Loading

smallinsky May 26, 2022

gabrielcorado May 30, 2022

smallinsky May 26, 2022

gabrielcorado May 30, 2022

gabrielcorado commented May 31, 2022

Joerger commented May 31, 2022

smallinsky left a comment •

edited

Loading

Joerger left a comment

	sess.io.OnWriteError = func(idString string, err error) {
	if idString == sessionRecorderID {
	sess.log.Error("Failed to write to session recorder, stopping session.")
	// stop in goroutine to avoid deadlock
	go sess.Stop()
	}
	}

		if idString == sessionRecorderID {
		switch s.scx.Identity.RoleSet.SessionRecordingMode(constants.SessionRecordingServiceSSH) {

SSH Session recording modes #12916

SSH Session recording modes #12916

Conversation

gabrielcorado commented May 25, 2022 • edited by Joerger Loading

Joerger commented May 25, 2022

Choose a reason for hiding this comment

smallinsky May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Joerger May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Joerger May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smallinsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrielcorado May 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrielcorado commented May 31, 2022

Joerger commented May 31, 2022

smallinsky left a comment • edited Loading

Choose a reason for hiding this comment

Joerger left a comment

Choose a reason for hiding this comment

gabrielcorado commented May 25, 2022 •

edited by Joerger

Loading

smallinsky May 31, 2022 •

edited

Loading

Joerger May 31, 2022 •

edited

Loading

Joerger May 31, 2022 •

edited

Loading

gabrielcorado May 30, 2022 •

edited

Loading

smallinsky left a comment •

edited

Loading