Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add relative standard deviation to aggregated test execution metrics #681

Merged
merged 4 commits into from
Oct 25, 2024

Conversation

OVI3D0
Copy link
Member

@OVI3D0 OVI3D0 commented Oct 14, 2024

Description

Adds relative standard deviation to aggregated test result metrics. Users can now see the spread of their OSB test results. (RSD is currently only calculated for the mean)

Example:

   {
    "task": "wait-until-merges-finish",
    "operation": "wait-until-merges-finish",
    "throughput": {
     "min": 74.41136845553248,
     "mean": 74.41136845553248,
     "median": 74.41136845553248,
     "max": 74.41136845553248,
     "unit": "ops/s",
     "mean_rsd": 5.7260883182218825
    },
    "latency": {
     "100_0": 12.996128527447581,
     "mean": 12.996128527447581,
     "unit": "ms",
     "mean_rsd": 5.875897232448762
    },
    "service_time": {
     "100_0": 12.996128527447581,
     "mean": 12.996128527447581,
     "unit": "ms",
     "mean_rsd": 5.875897232448762
    },
    "client_processing_time": {
     "100_0": 0.6900834268890321,
     "mean": 0.6900834268890321,
     "unit": "ms",
     "mean_rsd": 1.172341871616217
    },
    "processing_time": {
     "100_0": 13.791589008178562,
     "mean": 13.791589008178562,
     "unit": "ms",
     "mean_rsd": 5.595395107897675
    },
    "error_rate": 0.0,
    "error_rate_rsd": 0,
    "duration": 3.3562859753146768,
    "duration_rsd": 1.6848600926145232
   },

Issues Resolved

#662

Testing

  • New functionality includes testing

make test


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

osbenchmark/aggregator.py Outdated Show resolved Hide resolved
osbenchmark/aggregator.py Outdated Show resolved Hide resolved
Comment on lines 128 to 137
for metric in self.metrics:
op_metric[metric] = aggregated_task_metrics[metric]
if isinstance(aggregated_task_metrics[metric], dict):
mean_values = [v['mean'] for v in task_metrics[metric]]
rsd = self.calculate_rsd(mean_values)
op_metric[metric]['mean_rsd'] = rsd
else:
rsd = self.calculate_rsd(task_metrics[metric])
op_metric[f"{metric}_rsd"] = rsd

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments will be helpful here -- the first clause is for the standard metrics, the second for derived ones like error rate and duration, etc.

The documentation should indicate the rationale as to why RSD is computed for the mean, instead of the median.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also speak to nate about these changes👍

Comment on lines 206 to 212
if iterations > 1:
weighted_sum = sum(value * iterations for value in values)
total_iterations = iterations * len(values)
weighted_metrics[metric] = weighted_sum / total_iterations
weighted_avg = weighted_sum / total_iterations
else:
weighted_metrics[metric] = sum(values) / len(values)
weighted_avg = sum(values) / len(values)
weighted_metrics[metric] = weighted_avg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the special case for iterations == 1? Doesn't this reduce to the same computation? And if the iteration number is the same for every element in values, the special-casing might not be needed.

Copy link
Member Author

@OVI3D0 OVI3D0 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special case is fixed, but I want to include support for custom iteration values in the future so I think that should stay for now.

osbenchmark/aggregator.py Outdated Show resolved Hide resolved
Signed-off-by: Michael Oviedo <[email protected]>
return weighted_metrics

def calculate_rsd(self, values):
if not values:
raise ValueError("Cannot calculate RSD for an empty list of values")
Copy link
Collaborator

@IanHoang IanHoang Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice for error messages to also include which metric OSB was unable to calculate RSD for. Otherwise, users will have to add logging statements to find out which can be a nuisance.

Can easily do this if we add metric name as a parameter and include that in the Value Error message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I added this👍

@@ -214,9 +214,9 @@ def calculate_weighted_average(self, task_metrics: Dict[str, List[Any]], iterati

return weighted_metrics

def calculate_rsd(self, values):
def calculate_rsd(self, values, metric_name: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We should also incldue the type hints for values if we are providing one for metric_name

@OVI3D0 OVI3D0 requested a review from IanHoang October 24, 2024 16:07
Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to change the 100_0 to max in the results summary, either in this or a future check-in.

@OVI3D0
Copy link
Member Author

OVI3D0 commented Oct 25, 2024

You may want to change the 100_0 to max in the results summary, either in this or a future check-in.

Should this be done for regular test executions as well? 100_0 is used across regular test executions as well. Example:

    "latency": {
     "50_0": 206.59688568115234,
     "100_0": 235.18186950683594,
     "mean": 203.5783863067627,
     "unit": "ms"
    },

@IanHoang IanHoang merged commit d69ff25 into opensearch-project:main Oct 25, 2024
10 checks passed
@OVI3D0 OVI3D0 deleted the add-rsd branch October 28, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants