-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035
Comments
I'm going to run some performance tests for my usecase. |
I can measure a clear performance advantage from tuning these parameters a bit. Setting
median=5.6 seconds P95=6.7 seconds I did not try especially hard to tune this but I think this is enough evidence to justify exposing these config options. Exposing these configs options is easy. |
take |
Given how small this change is I think I will make one PR for C++ and Python. Therefore I will wait for #40021 to merge first. |
I think exposing all these settings in My suggestion: we keep statistics about the An alternative to the named policies can be: we expose only [1] set it to some multiple of CPU cores by default |
These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that. |
Start exposing My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads. The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization ( |
I unassigned myself because I don't really know when I can work on this, but I might pick it up again at a later date. |
I've been thinking a bit about a policy we can use to set these parameters automatically (as @felipecrv suggested). So far my only idea is to make each call to If we imagine varying I've also taken a bit of a look at how I'm definitely going to dig a bit deeper into what |
Describe the enhancement requested
Optimisation to #37511
Child of #18014
When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.
As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native
AzureFileSystem
:ReadAt
for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.ReadAt
theAzureFileSystem
callsBlobClient::DownloadTo
which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.Component(s)
C++
The text was updated successfully, but these errors were encountered: