Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictive model to find optimal reduction parameters #459

Open
balanz24 opened this issue May 14, 2024 · 4 comments
Open

Predictive model to find optimal reduction parameters #459

balanz24 opened this issue May 14, 2024 · 4 comments

Comments

@balanz24
Copy link
Contributor

This is a possible solution to #418

Our model aims to predict the optimal split_every value that makes the reduction as fast as possible.
This parameter affects the input data size of each function, the total number of stages and the number of functions x stage.

Evaluation has only be done in Lithops, but should be extended to further backends.

The model predicts 3 components of the execution time separately:

  • Invocation: The time it takes for the functions to start executing cubed code since the parallel map job is submited.
  • I/O: The time that functions spend reading and writing zarr files from/to object storage.
  • CPU: The cpu time that functions spend performing reduction computations.

Invocation and CPU times are easy to predict using linear regression, as they increase linearly as the dataset to reduce increases. As for the I/O time, it is predicted using the primula-plug presented in Primula paper.

Here we see a comparison of the real vs predicted times in a quadratic means test of 15 GB. This has been measured using lithops on AWS Lambda and S3.

As we can see the model is able to predict the optimal split_every=4 which gives the lowest execution time.

Some observations on the results:

  • Invocation overheads have a very significant weight over the total time, but further backends remain to be evaluated to see if they can be lower.
  • Since the CPU time seems to be insignificant, the model could be integrated into cubed only considering I/O and invocation overheads.
@tomwhite
Copy link
Member

Thanks for doing this work @balanz24!

It would be interesting to see if the results changed with larger datasets on the same quadratic means. In particular, does the optimal value of split_every increase once the number of tasks exceeds the number of workers (1000 on AWS Lambda)?

Making this easy to use for Cubed users, or integrating it as a plugin would be a great addition.

@balanz24
Copy link
Contributor Author

During this week I've been testing the model with larger datasets and the results look promising.

Particulary I've used a >300GB dataset, setting optimize_graph=False to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested. The predictions obtained are farther from the real values compared to smaller datasets, but the trend remains the same. It is able to find the optimal split_every, which is indeed increasing (around 6 to 8 in this case).

The next steps would be:

  • Adapting the model to work with modal backend.
  • Checking if the model would also work predicting the cost, not only execution time.
  • Simplifying the model in order to integrate it to cubed (we can discuss it in the next meeting).

@TomNicholas
Copy link
Member

we can discuss it in the next meeting)

FYI we're gonna skip the meeting this coming Monday - see https://discourse.pangeo.io/t/new-working-group-for-distributed-array-computing/2734/56?u=tomnicholas

@tomwhite
Copy link
Member

tomwhite commented Jun 3, 2024

Particulary I've used a >300GB dataset, setting optimize_graph=False to avoid fusing operations in order to have stages with more than 1000 workers, as you suggested.

I wouldn't set optimize_graph=False as this will avoid doing any optimization. What I was suggesting was to scale up so the number of chunks in the input was over 1000, so that all workers were used. Even with fusion there would still be over 1000 tasks at the first stage of the computation.

@tomwhite tomwhite mentioned this issue Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants