-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add session cleanup audit logging #122419
Conversation
Pinging @elastic/kibana-security (Team:Security) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments below.
Nice job on this so far, and I like the new audit-related TSdocs!
x-pack/plugins/security/server/session_management/session_index.test.ts
Outdated
Show resolved
Hide resolved
const { body: searchResponse } = | ||
await this.options.elasticsearchClient.search<SessionIndexValue>( | ||
{ | ||
index: this.indexName, | ||
body: { query: { bool: { should: deleteQueries } } }, | ||
_source_includes: 'usernameHash,provider', | ||
}, | ||
{ ignore: [409, 404] } | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we spoke about this last week we talked a bit about scale:
It's not very probable, but if an operator has Kibana configured to clean up sessions very infrequently and/or there are tons of users in a multi-Kibana setup, we could have lots of sessions to clean up, potentially hundreds of thousands.
I believe I suggested generating some test sessions in ES and, in a single task, attempting to delete up to 100k sessions (paging through 10k at a time, which is the max for any given search). This sounded like a reasonable starting point and I thought we should see how long it takes Kibana to do that -- we don't want to "clog up" the task manager, so we might need to dial that down depending on how long it takes (could ask folks in the Alerting team for input regarding this). Anything over that limit would simply be left for the next cleanup run to deal with.
Here are the docs for pagination:
By default, searches return the top 10 matching hits. To page through a larger set of results, you can use the search API's
from
andsize
parameters. Thefrom
parameter defines the number of hits to skip, defaulting to0
. Thesize
parameter is the maximum number of hits to return. Together, these two parameters define a page of results.
...
By default, you cannot usefrom
andsize
to page through more than 10,000 hits. This limit is a safeguard set by theindex.max_result_window
index setting. If you need to page through more than 10,000 hits, use thesearch_after
parameter instead.
The way this code is currently written, Kibana will only find/delete at most 10 sessions during the cleanup phase.
This might be overkill, we could consider just finding/deleting 10k sessions per cleanup (240k per day by default) and leaving it at that. It would probably be fine for 99.99% of our users today, but I wonder if that will still be true 1, 2, or 5 years from now. Also, we can never predict exactly how users will (ab)use our configuration options 😄 we might have some folks who change their session cleanup interval to 30 days!
The pagination approach is more work, but it should allow us to delete up to an order of magnitude more expired sessions during each cleanup, and my gut feeling is that is the better thing to do.
Also, it might not be a bad idea to collect usage data on how many sessions are getting created / deleted every day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point, thanks! I mixed up the default with the maximum pagination size.
I've added pagination now so we're removing sessions in batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion on this...
Unfortunately you haven't actually added pagination since you're just using the size
option, if there are 10,001 hits you've got an infinite loop because 1. you'll keep fetching the first 10k hits each time it loops, and 2. you are using hits.total
which is the grand total for all hits (it's not the total for what was returned in this page).
You have to use either of these combinations:
size
andfrom
(for < 10k hits)size
andsearch_after
(for any number of hits)
In either case you should really use a point-in-time (PIT), as specified in the search docs for pagination using search_after
. If you don't do that, the index state could change between pages and your results will be incorrect.
In this situation we are concerned with there potentially being hundreds of thousands of results, so you should use size
and search_after
.
Also, I don't think you should use a page size of 1k, the size can be up to 10k. Since you aren't fetching a lot of fields, and this isn't a hot code path, I'd suggest using a size of 10k.
So you'll need to change this to:
- Create a PIT
- Page over 10k results at a time (up to 100k)
- Log a warning if there are over 100k total hits
- Clean up the PIT
- Delete the expired sessions
I'd also like to see data on how this actually performs; we assume that 100k is a reasonable limit (I'd say that shouldn't take more than ~20s total) but we aren't certain. If you can generate some fake session data and run it locally to see how long the cleanup would take, that would be great. You could use console.time
and console.timeEnd
for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on, I'm deleting the items that are being returned by the first batch. So next time I run the search I'm getting the next 1,000 sessions. I'm repeating that until there are no more sessions left in the index. Am I misunderstanding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I skimmed your changes and misread -- you're searching and deleting in each loop.
I had it in my head to page through all the objects then delete. Sorry about that. That's what I was implying with refresh: false
, because if you don't refresh the index then your subsequent searches won't return fresh results.
You don't want to loop it like it's currently structured (search + delete in a single iteration) because then you'll be forced to use refresh: 'wait_for'
each time you delete, which will really hurt performance.
EDIT: for that matter, it would probably be a good idea to add a manual index refresh after the bulk-delete.
EDIT 2: the docs state:
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
Given that, I'm actually not sure if there will be any benefit to adding a manual index refresh, since we're the only ones searching the session index, and we only do this during cleanup. The index appears to automatically refresh shortly after the cleanup process completes.
x-pack/plugins/security/server/session_management/session_index.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/security/server/session_management/session_index.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/security/server/session_management/session_index.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned that this needs to be tested with a large amount of expired sessions to make sure it works. I put together a sample archive of 100,005 expired sessions:
expired-sessions-archive.zip
First, I added some console logs to the cleanup code.
Then, I created this kibana.yml
file:
elasticsearch.hosts: ["http://localhost:9200"]
elasticsearch.username: kibana
elasticsearch.password: changeme
xpack.security.session.cleanupInterval: "2m"
Then I started ES with yarn es snapshot
and started Kibana with node scripts/kibana -c /path/to/kibana.yml
(to run this in a production-like mode, without the base path proxy server)
Kibana immediately cleans up sessions when it starts, but I had to wait for it to finish doing fleet setup (it indexes a large amount of data into ES when it does that).
I waited about 2m and the cleanup happened again, at that point I loaded in the test archive:
node scripts/es_archiver.js --es-url http://elastic:changeme@localhost:9200 --kibana-url http://elastic:changeme@localhost:5601 load /path/to/expired-sessions-archive --docs-only
It take a while (~40-50 seconds) for that to complete.
Finally after another 2m had passed, the session cleanup happened again and 100k expired sessions got deleted. Another 2m later and the remaining 5 expired sessions got deleted:
[2022-01-12T11:47:59.367-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.012-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.361-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.694-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.956-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:01.234-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:01.502-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:01.764-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:02.071-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:02.654-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
SESSION CLEANUP: elapsed time 3.671s
[2022-01-12T11:50:02.284-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 5 sessions...
SESSION CLEANUP: elapsed time 71.2ms
I had to make some tweaks to get this working (see comments below). Overall nice job on this, it works really well and it's even faster than I expected 🎉
// Create point in time snapshot to paginate through sessions | ||
const { body: pitResponse } = await this.options.elasticsearchClient.openPointInTime({ | ||
index: this.indexName, | ||
keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to close the PIT after you are finished searching (elasticsearchClient.closePointInTime
)
// We don't know the total number of sessions until we fetched the first batch so assume we have at least one session to clean. | ||
let total = 1; | ||
let searchAfter: SortResults | undefined; | ||
for (let i = 0; i < total / SESSION_INDEX_CLEANUP_BATCH_SIZE; i++) { | ||
const { body: searchResponse } = | ||
await this.options.elasticsearchClient.search<SessionIndexValue>({ | ||
pit: { id: pitResponse.id, keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE }, | ||
_source_includes: 'usernameHash,provider', | ||
query: { bool: { should: deleteQueries } }, | ||
search_after: searchAfter, | ||
size: SESSION_INDEX_CLEANUP_BATCH_SIZE, | ||
sort: '_shard_doc', | ||
}); | ||
if (searchResponse.hits.hits.length > 0) { | ||
yield searchResponse.hits.hits; | ||
total = (searchResponse.hits.total as SearchTotalHits).value; | ||
searchAfter = searchResponse.hits.hits[searchResponse.hits.hits.length - 1].sort; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and if you have > 10k sessions, this loop terminates after one iteration, because the total
is > 10k.
I don't think you need to use the total
here at all, in fact after looking into this a bit more, the total hits is not accurate for amounts > 10k anyway and we get a performance improvement if we disable it.
Furthermore, this loop has no limit, as I mentioned above we need some sort of limit.
I suggest changing the loop like so, I tested this and it worked well:
diff --git a/x-pack/plugins/security/server/session_management/session_index.ts b/x-pack/plugins/security/server/session_management/session_index.ts
index 037c47ca791..554d917b834 100644
--- a/x-pack/plugins/security/server/session_management/session_index.ts
+++ b/x-pack/plugins/security/server/session_management/session_index.ts
@@ -7,7 +7,6 @@
import type {
BulkOperationContainer,
- SearchTotalHits,
SortResults,
} from '@elastic/elasticsearch/lib/api/typesWithBodyKey';
@@ -48,6 +47,12 @@ const SESSION_INDEX_TEMPLATE_VERSION = 1;
*/
const SESSION_INDEX_CLEANUP_BATCH_SIZE = 10_000;
+/**
+ * Maximum number of batches per cleanup.
+ * If the batch size is 10,000 and this limit is 10, then Kibana will remove up to 100k sessions per cleanup.
+ */
+const SESSION_INDEX_CLEANUP_BATCH_LIMIT = 10;
+
/**
* Number of sessions to remove per batch during cleanup.
*/
@@ -559,10 +564,8 @@ export class SessionIndex {
keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE,
});
- // We don't know the total number of sessions until we fetched the first batch so assume we have at least one session to clean.
- let total = 1;
let searchAfter: SortResults | undefined;
- for (let i = 0; i < total / SESSION_INDEX_CLEANUP_BATCH_SIZE; i++) {
+ for (let i = 0; i < SESSION_INDEX_CLEANUP_BATCH_LIMIT; i++) {
const { body: searchResponse } =
await this.options.elasticsearchClient.search<SessionIndexValue>({
pit: { id: pitResponse.id, keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE },
@@ -571,11 +574,15 @@ export class SessionIndex {
search_after: searchAfter,
size: SESSION_INDEX_CLEANUP_BATCH_SIZE,
sort: '_shard_doc',
+ track_total_hits: false, // for performance
});
- if (searchResponse.hits.hits.length > 0) {
- yield searchResponse.hits.hits;
- total = (searchResponse.hits.total as SearchTotalHits).value;
- searchAfter = searchResponse.hits.hits[searchResponse.hits.hits.length - 1].sort;
+ const { hits } = searchResponse.hits;
+ if (hits.length > 0) {
+ yield hits;
+ searchAfter = hits[hits.length - 1].sort;
+ if (hits.length < SESSION_INDEX_CLEANUP_BATCH_SIZE) {
+ break;
+ }
}
}
}
In addition: can you please add two unit tests -- one for 10,001 sessions to demonstrate the loop terminates early, and one for 100,001 sessions to demonstrate that the loop doesn't exceed the configured limit.
/** | ||
* Number of sessions to remove per batch during cleanup. | ||
*/ | ||
const SESSION_INDEX_CLEANUP_KEEP_ALIVE = '5m'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Number of sessions to remove per batch during cleanup. | |
*/ | |
const SESSION_INDEX_CLEANUP_KEEP_ALIVE = '5m'; | |
/** | |
* How long the session cleanup search point-in-time should be kept alive. | |
*/ | |
const SESSION_INDEX_CLEANUP_KEEP_ALIVE = '5m'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran it again with the existing debug logs enabled. This time I also enabled audit logs. It was almost as fast, took about 5s to delete 100k sessions.
[2022-01-12T15:13:27.195-05:00][DEBUG][plugins.security.session.index] Running cleanup routine.
[2022-01-12T15:13:28.214-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:28.768-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:29.197-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:29.614-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:30.004-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:30.453-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:30.857-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:31.301-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:31.714-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:32.167-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:15:30.237-05:00][DEBUG][plugins.security.session.index] Running cleanup routine.
[2022-01-12T15:15:30.302-05:00][DEBUG][plugins.security.session.index] Cleaned up 5 invalid or expired sessions.
Again, nice job 😄
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
History
To update your PR or re-run it, just comment with: |
* Add session cleanup audit logging * Update snapshots * Added suggestions from code review * Clean up sessions in batches * Added suggestions form code review (cherry picked from commit 39cef8b)
💔 Some backports could not be created
How to fixRe-run the backport manually:
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
We are going to keep hitting issues attempting to back-port any audit log changes to 7.x branch due to:
If it's a business priority to back port to 7.x it can be done but I'd suggest to focus on 8.x for new audit logging capabilities. |
Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync. |
Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync. |
* Add session cleanup audit logging (#122419) * Add session cleanup audit logging * Update snapshots * Added suggestions from code review * Clean up sessions in batches * Added suggestions form code review (cherry picked from commit 39cef8b) * Fix types Co-authored-by: Kibana Machine <[email protected]>
Resolves: #119490
Summary
Add audit logging events for session cleanup
Testing
logs/audit.log
after 40s