Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

Open
Tom-Newton opened this issue Feb 11, 2024 · 9 comments

Comments

@Tom-Newton
Copy link
Contributor

Describe the enhancement requested

Optimisation to #37511
Child of #18014

When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native AzureFileSystem:

  1. Arrow will make concurrent calls to ReadAt for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.
  2. Within ReadAt the AzureFileSystem calls BlobClient::DownloadTo which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.

Component(s)

C++

@Tom-Newton
Copy link
Contributor Author

I'm going to run some performance tests for my usecase.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Feb 12, 2024

I can measure a clear performance advantage from tuning these parameters a bit.
Defaults: median=9.0 seconds P95=12.5 seconds

Setting

        initial_chunk_size=2*1024*1024,
        chunk_size=2*1024 * 1024,
        concurrency=10

median=5.6 seconds P95=6.7 seconds

I did not try especially hard to tune this but I think this is enough evidence to justify exposing these config options. Exposing these configs options is easy.

@Tom-Newton
Copy link
Contributor Author

take

@Tom-Newton
Copy link
Contributor Author

Given how small this change is I think I will make one PR for C++ and Python. Therefore I will wait for #40021 to merge first.

@felipecrv
Copy link
Contributor

I think exposing all these settings in AzureOptions can be premature. They are per-request settings, so allowing a config in AzureOptions would force the internal implementation to stick to one set of values of every DownloadTo request.

My suggestion: we keep statistics about the ReadAt calls in a file handle and adjust the options as the calls come in. After this exercise we might expose settings in AzureOptions that expresses which policy should be used (assuming we can't come up with a good adaptive policy). What the policies would be called depends on which workloads we can isolate in the benchmarks: latency vs throughput, sequential vs random.

An alternative to the named policies can be: we expose only read_file_max_concurrency [1], and we can keep learning a simple model that sets the best initial_chunk_size and chunk_size parameters adaptively. The goal becomes minimizing the latency of each individual ReadAt call and the user can set a low max concurrency factor for latency-optimized workloads and a high concurrency factor for throughput-optimized. A low concurrency factor would also be useful when the user of the FileSystem interface is managing multiple threads themselves.

[1] set it to some multiple of CPU cores by default

@Tom-Newton
Copy link
Contributor Author

These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

@felipecrv
Copy link
Contributor

These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

Start exposing read_file_max_concurrency with a generous value and hardcode values for initial_chunk_size and chunk_size inside azurefs.cc to your liking, then open a separate issue that I can work on.

My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads.

The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization (>256MB). We know Arrow workloads are probably small to medium ReadAt calls to the same file on a single thread (a Jupyter notebook driving the calls) so we benefit from the parallelism offered by the SDK: lower initial_chunk_size and chunk_size with a high concurrency.

@Tom-Newton Tom-Newton removed their assignment Mar 20, 2024
@Tom-Newton
Copy link
Contributor Author

I unassigned myself because I don't really know when I can work on this, but I might pick it up again at a later date.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented May 19, 2024

I've been thinking a bit about a policy we can use to set these parameters automatically (as @felipecrv suggested). So far my only idea is to make each call to ReadAt one iteration of an optimisation algorithm e.g. gradient descent optimising for ReadAt duration.

If we imagine varying chunk_size from very small to very large the optimal performance is going to be somewhere in the middle and I expect the optimum will depend on latency and bandwidth between blob storage and the client.

I've also taken a bit of a look at how azcopy implements this since my experience is that azcopy is very fast in a wide variety of situations without needing to provide any configuration.
I thought it was interesting that it deliberately tries to read out of order https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21
I think the downloading actually happens here https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332
There is mention of applying some auto-correct policy on the block-size if the user did not set it.

I'm definitely going to dig a bit deeper into what azcopy does before I commit to a particular strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants