Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving file eviction performance #696

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Commits on Sep 1, 2023

  1. Reject high-cost requests instead of creating more OS threads when ov…

    …erloaded
    
    We have been using a file removal semaphore with weight 5,000 (half of Go's
    default 10,000 maximum OS threads, beyond which Go will crash), in an attempt
    to avoid crashing when the filesystem/storage layer can't keep up with our
    requirements.
    
    This change renames that semaphore to `diskWaitSem` and also uses it for
    disk-write operations. When the semaphore cannot be acquired for disk-writes,
    we return HTTP 503 (service unavailable) or gRPC RESOURCE_EXHAUSTED error codes
    to the client.
    
    Relates to buchgr#638
    mostynb authored and ulrfa committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    1bc4fc9 View commit details
    Browse the repository at this point in the history
  2. Optimize file removals (part 1)

    This commit:
    
     - Performs evictions from a single background goroutine that receives
       files to be removed via a channel.
    
     - Throttles number of concurrent Put requests with semaphore (but not
       rejecting them).
    
    In order to:
    
     - Avoid crashing on high load.
    
     - Achieve up to 3 times faster cache eviction.
    
     - Achieve up to 70% higher write throughput in scenario with many
       cache evictions.
    
    The cache can grow above max_size when asynchronous files removals do
    not catch up with new file writes. This is addressed in the following
    part 2 commit. This issue was masqueraded in previous bazel-remote
    versions by instead running out of operating system threads and crash.
    
    Change-Id: Ifa2ed6c5a093adbb407750a0d38a4181a07f227f
    ulrfa committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    62b1b16 View commit details
    Browse the repository at this point in the history
  3. Introduce disk_size_limit (part 2)

    Introduce a disk_size_limit for the total disk space of:
    
     - Files currently in the cache.
     - Reserved space for files currently being uploaded.
     - Evicted files not yet removed.
    
    Setting this limit is optional (at least for now).
    
    Reservations for Put requests are rejected when
    disk_size_limit is exceeded.
    
    The prometheus gauge bazel_remote_disk_cache_size_bytes is
    updated to be a max value for the previous 30 seconds,
    in order to be aware of short spikes when tuning the
    disk_size_limit configuration.
    
    There is also a new prometheus gauge
    bazel_remote_disk_cache_size_bytes_limit showing current
    configured limits in order to help visualize if current size
    is getting close to the limit and help tuning the
    disk_size_limit.
    
    Change-Id: Iaec29af9a2e02796c29f294b993989783d575c4b
    ulrfa committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    8196189 View commit details
    Browse the repository at this point in the history
  4. Prevent verbose error log on overload (part 3)

    Use access logger instead of error logger when
    requests are rejected due to overload, in order
    to avoid too verbose error log when many requests
    are rejected.
    
    Number of rejects can also be monitored via codes
    in the prometheus metrics
    http_request_duration_seconds_count and
    grpc_server_handled_total
    
    Change-Id: I5d5999360b3e49b153fd6f122e2244d4789cf2ff
    ulrfa committed Sep 1, 2023
    Configuration menu
    Copy the full SHA
    25527a8 View commit details
    Browse the repository at this point in the history