Cost per job #93

cmelone · 2024-08-20T20:28:29Z

Closes #75

Computes and stores the following metrics:

cpu_cost: cost of using CPU resources on the node, based on the CPU request of the job
mem_cost
cpu_penalty: penalty factor that represents the over or under allocation of CPU resources
mem_penalty

To normalize the cost of resources within instance types, we'll define cost per resource metrics.

$$\text{Cost per CPU}_i = \frac{C_i \times 0.5}{\text{CPU}_i}$$

$$\text{Cost per RAM}_i = \frac{C_i \times 0.5}{\text{RAM}_i}$$

$C_i$ is the cost of the node over the life of the job
$CPU_i$ is the number of CPUs available on node $i$
$RAM_i$ is the amount of RAM available on node $I$
$C_i$ should be divided in half when computing cost per resource numbers, as we assume that CPU and RAM represent the two halves of the cost of the node.

$$\text{Job Cost} = (\text{CPU}_{\text{usage}} \times \text{Cost per CPU}_i + \text{RAM}_{\text{usage}} \times \text{Cost per RAM}_i)$$

Using this base cost per job metric, jobs are rewarded for minimizing usage and wall time. However, it does not penalyze them for disruptions to the cluster caused by misallocation.

Underallocation can potentially slow down other jobs on the same node, and overallocation delays scheduling of other jobs. A penalty factor would be useful for quantifying negative impacts to the CI system and encourage better resource requests.

$$\text{P}_{\text{CPU}} = |\text{CPU}_{usage}-\text{CPU}_{request}|$$

$$\text{P}_{\text{RAM}} = |\text{RAM}_{usage}-\text{RAM}_{request}|$$

With the penalty, cost per job would be:

$$((\text{CPU}_{\text{usage}} + \text{P}_{\text{CPU}}) \times \text{Cost per CPU}_i + (\text{RAM}_{\text{usage}} + \text{P}_{\text{RAM}}) \times \text{Cost per RAM}_i )$$

Job cost and $P$ are stored separately as the former represents "true" cost, while the latter can be used to measure the efficiency of its resource requests via an artificial penalty. When analyzing costs, node instance type should be controlled for because cost per job is influenced by $\text{Cost per CPU}_i$ and $\text{Cost per RAM}_i$, which will vary among instance types.

For example:

instance costs 100 cents per hour
instance has 100GB memory and 100 cores
job duration was 30 minutes
resource requests: 2GB memory, 2 cores
mean usage: 1GB memory, 5 cores

$C_i$ = 50 cents (cost of the instance while the job ran)

$$\text{Cost per CPU}_i = \frac{50 \times 0.5}{\text{100}} = 0.25 cents$$

$$\text{Cost per RAM}_i = \frac{50 \times 0.5}{\text{100}} = 0.25 cents$$

therefore,

$$\text{Cost for RAM}_i = 0.25 \times 1 = 0.25 cents$$

$$\text{Cost for CPUs}_i = 0.25 \times 5 = 2.5 cents$$

computing the penalties:

$$\text{P}_{\text{CPU}} = |5-2|= 3$$

$$\text{P}_{\text{RAM}} = |1-2| = 1$$

In this case, we penalize the job for using more CPU than it requested, which could have crowded out other jobs. We also penalize the job for using less RAM than requested because when k8s scheduled the job, it blocked those resources from being scheduled for other work.

"total" cost:

$$((5 + 3) \times 0.25 + (1 + 1) \times 0.25) = 2.5$$

cmelone · 2024-10-10T20:40:24Z

requesting @tgamblin review of cost formula not code

Closes #75 Computes and stores the following metrics: - cpu_cost: cost of using CPU resources on the node, based on the CPU request of the job - mem_cost - cpu_penalty: penalty factor that represents the over or under allocation of CPU resources - mem_penalty

if in the future we'd like to make modifications with this cost analysis it'd be useful to have these metrics as the basis of the calculation

Pulled from #93 to collect data before deciding on final cost formula. Adds the following columns: - nodes: AWS zone (`zone`) - nodes: Instance capacity type (`capacity_type`) - jobs: Cost of the instance during the lifetime of the job (`job_cost_instance`) The `job_cost_instance` calculation is made by averaging the value of `karpenter_cloudprovider_instance_type_offering_price_estimate` during the lifetime of the node and multiplying by the duration of the build job. **This is not a cost per job metric.** Use information like cpu_mean, mem_mean, etc to calculate the cost of the job in combination with `job_cost_instance`. tested with `dev/bulk_collect.py` and verified that large migrations work correctly on the prod and staging db

cmelone self-assigned this Aug 20, 2024

cmelone requested a review from alecbcs August 20, 2024 20:30

cmelone added the feature New feature or request label Aug 20, 2024

cmelone marked this pull request as draft August 27, 2024 15:20

cmelone force-pushed the add/collect-cost branch from 0c534d6 to c8b4484 Compare October 10, 2024 18:44

cmelone changed the title ~~Collect node spot instance costs~~ Cost per job Oct 10, 2024

cmelone force-pushed the add/collect-cost branch 2 times, most recently from 4016e90 to b3c5b7b Compare October 10, 2024 20:19

cmelone marked this pull request as ready for review October 10, 2024 20:37

cmelone requested a review from tgamblin October 10, 2024 20:39

cmelone mentioned this pull request Oct 23, 2024

Add resource limits #106

Open

cmelone added 3 commits October 23, 2024 18:50

Add cost per job metrics

3dd0ec3

Closes #75 Computes and stores the following metrics: - cpu_cost: cost of using CPU resources on the node, based on the CPU request of the job - mem_cost - cpu_penalty: penalty factor that represents the over or under allocation of CPU resources - mem_penalty

simplify job cost and penalty calculation

3cbbd83

Collect cost_per_cpu and cost_per_mem

9672ac8

if in the future we'd like to make modifications with this cost analysis it'd be useful to have these metrics as the basis of the calculation

cmelone force-pushed the add/collect-cost branch from 076aab6 to 9672ac8 Compare October 23, 2024 21:51

cmelone mentioned this pull request Nov 12, 2024

Collect baseline job costs #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost per job #93

Cost per job #93

cmelone commented Aug 20, 2024 •

edited

Loading

cmelone commented Oct 10, 2024

Cost per job #93

Are you sure you want to change the base?

Cost per job #93

Conversation

cmelone commented Aug 20, 2024 • edited Loading

cmelone commented Oct 10, 2024

cmelone commented Aug 20, 2024 •

edited

Loading