Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ecs): Narrowing the cache search for the ECS provider on views #6256

Merged
merged 2 commits into from
Sep 25, 2024

Conversation

christosarvanitis
Copy link
Member

@christosarvanitis christosarvanitis commented Aug 6, 2024

Attempt to address some of the issues described in spinnaker/spinnaker#6084

Improving the response times on:

  • /clusters
  • /applications
  • /serverGroups
    Endpoints when ECS is enabled and a substantial amount of accounts/services exist in cache.

The perf issue with the Alarms still exists and will be addressed in a future PR

Adding some results from a performance test clouddriver response times:
image
image
image

  • GET {CLOUDDRIVER_URL}/applications
    • Average: 104ms → 92.1ms (11% improvement)
    • 95th Percentile: 130ms → 126ms
  • GET {CLOUDDRIVER_URL}/applications/{application_name}
    • Average: 7.48s → 4.2s (43% improvement)
    • 95th Percentile: 8.55s → 5.86s
  • GET {CLOUDDRIVER_URL}/applications/{application_name}/serverGroups
    • Average: 2.72s → 2.16s (20% improvement)
    • 95th Percentile: 3.17s → 3.11s
  • GET {CLOUDDRIVER_URL}/applications/{application_name}/clusters
    • Average: 107ms → 43.3ms (59% improvement)
    • 95th Percentile: 135ms → 88.4ms

@christosarvanitis
Copy link
Member Author

@dbyron0 @deverton would appreciate your feedback on this change. There are still improvements to be made as the current implementation of ECS goes through every region per account to retrieve the necessary data from cache which is far from ideal when there are hundreds of accounts.
The main idea here is to limit the retrieval with an application name when we can.

The perf of alarms is still a problem as right now it goes through all the alarms and tries to match with a service but this will be addressed in a future PR.

@christosarvanitis
Copy link
Member Author

@dbyron-sf @jasonmcintosh Added some results from an internal testing related to this change. Would appreciate any feedback!

@jasonmcintosh
Copy link
Member

Few minor things but overall looks good.

@christosarvanitis
Copy link
Member Author

@jasonmcintosh planning to push the Alarm caching/lookup perf improvements as well tomorrow.


Collection<EcsMetricAlarm> allMetricAlarms = getAll(accountName, region);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before All the alarms for an ECS account/region where fetched and iterated through to match the service. This is extremely costly.

After the change the ECSCluster is added during the caching cycles to the cache key id for the ECS provider in the alarms. We retrieve the IDs with ECS account/region/EcsClusterName and then try to match the service.

metricAlarms.add(metricAlarm);
continue outLoop;
}
if (metricAlarm.getAlarmActions().stream().anyMatch(action -> action.contains(serviceName))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small refactoring here to make it more readable

@@ -118,7 +118,13 @@ Map<String, Collection<CacheData>> generateFreshData(Set<MetricAlarm> cacheableM
Map<String, Collection<CacheData>> newDataMap = new HashMap<>();

for (MetricAlarm metricAlarm : cacheableMetricAlarm) {
String key = Keys.getAlarmKey(accountName, region, metricAlarm.getAlarmArn());
String cluster =
metricAlarm.getDimensions().stream()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the AWS SDK a cloudwatch alarm for the ECS contains 2 dimensions depending for the type:

  • Service alarm contains the dimension ECSCluster and ServiceName
  • Autoscaling group alarm of an ECS cluster contains the ECSCluster and the Capacity provider.

This change includes the ECSClusterName in the cached key id to make the search less costly

.setMoniker(moniker);

EcsServerGroup serverGroup = new EcsServerGroup();
if (includeDetails) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

includeDetails is false only for the getSummaries. The rest of the logic remains the same

@christosarvanitis
Copy link
Member Author

Following up with the Alarm changes the perf improvements:
image

The GET ServerGroups call reduced from 7+ secs on an application to 3.5secs.

And timings on a single request before the change:
image
And after the change:
image

@jasonmcintosh
Copy link
Member

Overall I think this looks good ;) Would like one more set of eyes given the changes on the way ECS operates. One concern is how the change on the cache id's will be cleaned up since this changes the storage ids - but that MAY get taken care of by one of the cleanup jobs (need to confirm). Talked in slack - the ID's should still fit the max column length so adding that shouldn't impact any database stuff.

@christosarvanitis
Copy link
Member Author

Thanks @jasonmcintosh! 🚀

but that MAY get taken care of by one of the cleanup jobs (need to confirm)

I have added a test that validates this. The previously ECS cached keys will be evicted and recached with the appended id for the alarms table.

@jasonmcintosh
Copy link
Member

OK will merge shortly. I'd LIKE to get a release notes updated for this please!

@jasonmcintosh jasonmcintosh added the ready to merge Approved and ready for a merge label Sep 25, 2024
@mergify mergify bot added the auto merged Merged automatically by a bot label Sep 25, 2024
@jasonmcintosh jasonmcintosh merged commit 3cdf32e into spinnaker:master Sep 25, 2024
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merged Merged automatically by a bot ready to merge Approved and ready for a merge target-release/1.36
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants