Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update #4

Merged
merged 49 commits into from
Jan 25, 2023
Merged

update #4

merged 49 commits into from
Jan 25, 2023

Commits on Jan 22, 2023

  1. [Core][run_function_on_all_workers] deflake run_function_on_all_worke…

    …rs and reenable test (#31838)
    
    run_function_on_all_workers importing requires job_id to run properly. after #30883 the worker might not have job_id when startup, which lead to run_function_on_all_workers failed to be executed on start up. to fix it, we defer the import_thread start up until job_config is initialized.
    scv119 authored Jan 22, 2023
    Configuration menu
    Copy the full SHA
    e9689ed View commit details
    Browse the repository at this point in the history
  2. [docs] Update reference K8s configuration in large cluster deployment…

    … guide. (#31307)
    
    Fixes an outdated K8s configuration reference in the large cluster deployment guide.
    
    Signed-off-by: Dmitri Gekhtman <[email protected]>
    DmitriGekhtman authored Jan 22, 2023
    Configuration menu
    Copy the full SHA
    c8c1da7 View commit details
    Browse the repository at this point in the history

Commits on Jan 23, 2023

  1. Polish the new IA for dashboard (#31770)

    Signed-off-by: Alan Guo <[email protected]>
    
    Update to more closely match the design spec.
    alanwguo authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    cdb3780 View commit details
    Browse the repository at this point in the history
  2. [Core][Bug Fix] Do not start the http server if include_dashboard is …

    …False. (#31666)
    
    Signed-off-by: SangBin Cho <[email protected]>
    
    Currently, when the include_dashboard is False, there are 2 issues.
    
    From the output of ray.init & ray start, we still prints the dashboard URL.
    Although we properly disabled all the modules, we start the HTTP server (which means users can still access the UI although it doesn't work).
    This PR fixes both issues.
    rkooo567 authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    c694dae View commit details
    Browse the repository at this point in the history
  3. [Core][Worker] opportunistically start import_thread if job_id is set (

    …#31846)
    
    Why are these changes needed?
    Previously the import thread starts as soon as the worker starts and imports the task/actor definition. After #31838, it is deferred until the first task is sent. That means we will have longer delay in the first execution.
    To address the problem, we can opportunistically start the import thread after the import thread is created, if the job_id does exist.
    scv119 authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    25d3d52 View commit details
    Browse the repository at this point in the history
  4. [1/N][Advanced timeline] Include events/profiling events to the ray l…

    …ist tasks (#31776)
    
    This PR
    
    changes scheduling_state -> state, which allows us to render state count from the frontend (it is also consistent with other schema).
    Add duration to the frontend.
    Add profile / regular events to the task state API.
    Support source side filtering for job id
    There are remaining work in the follow up
    
    Replace timeline implementation to use task API
    Implement timeline frontend.
    Display events / profile events from the dashboard/state API in a better format.
    rkooo567 authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    228b87f View commit details
    Browse the repository at this point in the history
  5. [Core][Enable gcs scheduler 7/n] Prefer actor owner's node (#30789)

    Signed-off-by: Chong-Li <[email protected]>
    
    This PR tries to finalize gcs actor scheduler, with the following changes:
    
    Similar to the legacy scheduler, gcs now prefers the actor owner's node. It's usually required by RL cases for better colocation.
    The normal task workers report zero CPU resources (instead of the allocated one) when they are blocked.
    Similar to the legacy scheduler, gcs now schedules empty-resource actors randomly.
    A new release test is added: multiple masters/drivers are creating (slave) actors concurrently. The case is able to expose the difference between centralized (gcs-based, less scheduling conflicts) and distributed schedulers.
    The feature flag is temporarily turned on when going through the CI pipline. We still need another dedicated PR to turn it on by default.
    Chong-Li authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    ec3243d View commit details
    Browse the repository at this point in the history
  6. [Core] Raise deprecation warning when passing non-iterables to imap a…

    …nd imap_unordered methods of ray.util.multiprocessing.Pool (#31845)
    
    Context: #24237
    
    We will raise this warning in Ray 2.3. For Ray 2.4, we will merge #31799 which causes a TypeError to be raised instead.
    cadedaniel authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    96854d5 View commit details
    Browse the repository at this point in the history
  7. [core] Delete old internal kv gcs (#31841)

    The code has been migrated to the StoreClientKV in PR and the old one is not useful anymore. This PR delete the old code.
    fishbone authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    9ab6421 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    7c58114 View commit details
    Browse the repository at this point in the history
  9. Make sure all-to-all operator return num outputs so progress bar can …

    …work (#31825)
    
    To enable the new bulk execution backend: #30903
    
    Based on the most recent test (https://buildkite.com/ray-project/oss-ci-build-pr/builds/9947#_), this should be last issue to fix it!
    (note the failure of Dataset tests is not real as all tests passing, some issue with bazel test)
    jianoaix authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    40c4571 View commit details
    Browse the repository at this point in the history
  10. Revert "[AIR] Deprecations for 2.3 (#31763)" (#31866)

    This reverts commit 91b632b.
    Alex Wu authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    58386d0 View commit details
    Browse the repository at this point in the history
  11. [Dataset] Use job-based file manager for SDK runner in iter_tensor_ba…

    …tches_benchmark_single_node test (#31864)
    
    The release test iter_tensor_batches_benchmark_single_node has failed the most recent run, due to the same issue discussed/addressed in #31752 and #31493 (the actual error message is: botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This PR updates one remaining test to match this convention.
    
    Signed-off-by: Scott Lee <[email protected]>
    scottjlee authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    54d87bc View commit details
    Browse the repository at this point in the history
  12. Revert "Revert "[AIR] Deprecations for 2.3 (#31763)" (#31866)" (#31867)

    This reverts commit 58386d0.
    Alex Wu authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    46b3bef View commit details
    Browse the repository at this point in the history
  13. [ci][core] Fix state api large scale test log prefix mismath (#31865)

    Signed-off-by: rickyyx <[email protected]>
    
    we added job id in the log prefix in #31772, breaking the test.
    
    This fixes the test to reflect the change.
    rickyyx authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    3b09a54 View commit details
    Browse the repository at this point in the history
  14. Documentation about using existing Grafana instance. (#31667)

    Signed-off-by: Bukic Tomislav, AVL <[email protected]>
    
    Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel.
    
    Fixed version of PR #31633
    tbukic authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    ee23cc8 View commit details
    Browse the repository at this point in the history
  15. [Tune] Nevergrad optimizer with extra parameters (#31015)

    Some Nevergrad search algorithms have required inputs, such as `budget` for the `NgOpt` search algorithm, but it is not possible with the NevergradSearch class to pass these parameters down to the search algorithm. I would propose adding something like optimizer_kwargs to the NevergradSearch that get passed to the optimizer when instantiating it.
    
    Signed-off-by: yhna <[email protected]>
    Signed-off-by: YH <[email protected]>
    Signed-off-by: Younghwan Na <[email protected]>
    Signed-off-by: yhna940 <[email protected]>
    Co-authored-by: Justin Yu <[email protected]>
    yhna940 and justinvyu authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    33d4b14 View commit details
    Browse the repository at this point in the history
  16. Revert "Simplify logging configuration. (#30863)" (#31858)

    This reverts commit 608276b.
    
    Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).
    rkooo567 authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    0c69020 View commit details
    Browse the repository at this point in the history
  17. [core][state] Proper report of failure when job finishes and for fini…

    …shed tasks (#31761)
    
    This PR handles 2 edges when marking tasks as fail:
    
    When a job finishes but tasks still running should be marked as failed.
    Don't override a task's finished or failed timestamp when an ancestor failed.
    For 1:
    
    It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed
    For 2:
    
    It adds an ancestor_failed_ts to keep track of ancestor failure time in the task tree.
    This extra bit of info is necessary to keep since we should not be overriding any already failed or finished child tasks's timestamps. But we will also need to know if any task subtree has been traversed (and all non-terminated children marked as failed) w/o traversing the task tree.
    When adding a new task event, If the task fails or its ancestor failed, its failed_ts and ancestor_failed_ts will be set, and we will traverse into the child task tree.
    During the tree traversal, when a task has its failed_ts or ancestor_failed_ts set, this means its children should have been traversed when its failed_ts or ancestor_failed_ts was set.
    rickyyx authored Jan 23, 2023
    Configuration menu
    Copy the full SHA
    86bd6c6 View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2023

  1. [Train] Change default NCCL_SOCKET_IFNAME to blacklist veth (#31824)

    Signed-off-by: amogkam [email protected]
    
    Closes #30333.
    
    Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310.
    
    However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333.
    
    Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!)
    
    Signed-off-by: amogkam <[email protected]>
    amogkam authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    9bebf57 View commit details
    Browse the repository at this point in the history
  2. [Core] temporary stop testing python grpcio prerelease package (#31873)

    we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.
    scv119 authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    d28007d View commit details
    Browse the repository at this point in the history
  3. [spark] ray on spark creates spark job using stage scheduling (#31397)

    - ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node
    
    - improve ray worker memory allocation computation.
    
    - refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values)
    
    Signed-off-by: Weichen Xu <[email protected]>
    WeichenXu123 authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    aa7d5d9 View commit details
    Browse the repository at this point in the history
  4. Fix setup-dev.py stdlib import shadowing (#31829)

    Python adds the current working directory as the first entry to `sys.path` . In some python versions, this will lead stdlib libraries to import modules in the working directory instead of other stdlib modules (python/cpython#101210). 
    
    In our case, the conflict is `types.py`. Because `types.py` contains a PublicAPI we don't want to just rename it without proper deprecation. Instead we fix the `sys.path` in `setup-dev.py` as this is the only location where the problem predictably comes up.
    
    Signed-off-by: Kai Fricke <[email protected]>
    krfricke authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    3e1cdeb View commit details
    Browse the repository at this point in the history
  5. [serve][docs] Document Gradio visualization (#28310)

    Documents the new visualization for Serve deployment graphs using Gradio.
    zcin authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    e7aabe8 View commit details
    Browse the repository at this point in the history
  6. [autoscaler] Add flag to disable periodic cluster status log. (#31869)

    This PR adds an env flag to disable periodic autoscaler cluster status logging.
    
    Users may wish to disable these logs because:
    
    The log entry is rather long and communicates cluster status rather than events in the cluster.
    The log is multi-line, which does not interact well with some users' logging setups.
    You can get the same info by running ray status.
    
    Signed-off-by: Dmitri Gekhtman <[email protected]>
    DmitriGekhtman authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    0cf060d View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    5682ef9 View commit details
    Browse the repository at this point in the history
  8. [AIR] Remove Checkpoint._object_ref (#31777)

    #31763 removes Checkpoint.from_object_ref, but doesn't remove the now-unused Checkpoint._object_ref. This PR cleans up the dead code.
    
    Signed-off-by: amogkam <[email protected]>
    Signed-off-by: Balaji Veeramani <[email protected]>
    Co-authored-by: amogkam <[email protected]>
    bveeramani and amogkam authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    e69fad4 View commit details
    Browse the repository at this point in the history
  9. [RLlib] Document RLlib fault tolerance and elastic training behaviors. (

    #31662)
    
    Signed-off-by: Jun Gong <[email protected]>
    Co-authored-by: Richard Liaw <[email protected]>
    Jun Gong and richardliaw authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    0a379af View commit details
    Browse the repository at this point in the history
  10. [State API] Fix a broken Window test (#31889)

    Those unit tests are not running in Python CI cuz some testing APIs are not available from Python < 3.8. So Windows failure was actual failure from unit tests. However, they were just testing issue, not a real code issue. I fixed the test failure here.
    Signed-off-by: SangBin Cho <[email protected]>
    rkooo567 authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    51f2d93 View commit details
    Browse the repository at this point in the history
  11. [serve][python3.11] Use tasks/futures for asyncio.wait (#31608)

    In Python 3.11, coroutines are no longer allowed in `asyncio.wait`. Need to pass in a task instead.
    zcin authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    8c88bdc View commit details
    Browse the repository at this point in the history
  12. [wandb] Have train_loop_config logged as a config. (#31901)

    Have train_loop_config logged as a config in wandb UI.
    
    This is done by surfacing Trainer._param_dict into Tuner.param_space. Currently, only done for train_loop_config.
    
    Signed-off-by: xwjiang2010 <[email protected]>
    xwjiang2010 authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    50c0395 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    73c60fd View commit details
    Browse the repository at this point in the history
  14. [Tune] Enable experiment restore from moved cloud uri (#31669)

    This PR fixes experiment restoration from a different cloud URI to save future results and checkpoints to new URI instead of continuing to write to the old location. The workflow of starting a local experiment, uploading the experiment dir to cloud, then restoring from the URI from a different cluster is also possible now.
    
    Signed-off-by: Justin Yu <[email protected]>
    justinvyu authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    47f4da3 View commit details
    Browse the repository at this point in the history
  15. [release] fix pytorch pbt failure test. (#31791)

    The regression is introduced by #30705.
    
    Also added some documentation into TorchTrainer so users know there is quite some magic happening :)
    
    Tested manually in workspace.
    Follow-up PR to add more strict assertions to the test.
    
    Signed-off-by: xwjiang2010 <[email protected]>
    xwjiang2010 authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    b1c261c View commit details
    Browse the repository at this point in the history
  16. [State API] Improve task api (#31847)

    Signed-off-by: SangBin Cho <[email protected]>
    
    Add worker id & pg id to the task state
    Add pg id to the actor state
    Add start / end time for worker state
    Add start / end time for node state
    rkooo567 authored Jan 24, 2023
    Configuration menu
    Copy the full SHA
    c464207 View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2023

  1. [WIP] Bulk executor initial implementation (#30903)

    Initial implementation of ray-project/enhancements#18
    
    Original prototype: https://github.com/ray-project/ray/pull/30222/files
    
    Co-authored-by: Clark Zinzow <[email protected]>
    Co-authored-by: jianoaix <[email protected]>
    3 people authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    877770e View commit details
    Browse the repository at this point in the history
  2. [RLlib] [Doc ]fix the broken Tune reference (#31918)

    Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
    kouroshHakha authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    dbb3a5f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4cb0081 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c85e453 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    7018e1e View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    220b7cc View commit details
    Browse the repository at this point in the history
  7. [RLlib] Clean up some algorithm.py config dict uses (replace by confi…

    …g objects) AND erroneous log messages. (#31854)
    sven1977 authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    477910b View commit details
    Browse the repository at this point in the history
  8. [core] Add the function to cleanup Redis backend. (#31782)

    This feature is a helper tool to clean up the redis storage. It's meant to be a solution to cleanup the old data in redis stored by Ray until we got the work of GCS storage backend done.
    
    This feature support redis cluster and non cluster mode and is for redis cleanup only. The feature is built with cython, so no external libraries needed.
    
    Since _raylet.so depends on redis_client implicitly, so size change in the ray pkg.
    fishbone authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    41e1685 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    6aec692 View commit details
    Browse the repository at this point in the history
  10. [core] Deflakey gcs server rpc test (#31919)

    The test failed because in the current ray redis client there is a global variable using io context. and the io context is freed during destruction and when destruct the global variable it'll cause issues.
    
    The current redis client is aweful and ugly. Since we'll move to redis plus plus, and this client doesn't cause other issues, I'll just fix the test case.
    fishbone authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    b7d6f2f View commit details
    Browse the repository at this point in the history
  11. [core] Migrate many_nodes_actor_tests to new cloud. (#31863)

    This PR make the test run with the new cloud to prevent regression.
    fishbone authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    d9dd326 View commit details
    Browse the repository at this point in the history
  12. [Serve] Upgrade deprecated calls (#31839)

    Serve uses `ray.get_runtime_context().job_id` and `ray.get_runtime_context().node_id`, which are deprecated. This raises long `RayDeprecationWarning` in the Serve CLI:
    
    ```console
    $ serve run example:graph
    2023-01-21 17:19:08,723	INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
    ...
    2023-01-21 17:19:15,149	INFO worker.py:1546 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
    (ServeController pid=49525) INFO 2023-01-21 17:19:15,949 controller 49525 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-79e5c05fd95270ecf67b0d43e9aef485109c6724cf07121b46e361c0' on node '79e5c05fd95270ecf67b0d43e9aef485109c6724cf07121b46e361c0' listening on '127.0.0.1:8000'
    (HTTPProxyActor pid=49530) INFO:     Started server process [49530]
    /Users/shrekris/Desktop/ray/python/ray/serve/_private/client.py:487: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    Use get_job_id() instead
      "deployer_job_id": ray.get_runtime_context().job_id,
    (ServeController pid=49525) INFO 2023-01-21 17:19:16,789 controller 49525 deployment_state.py:1311 - Adding 1 replica to deployment 'Pinger'.
    (HTTPProxyActor pid=49530) /Users/shrekris/Desktop/ray/python/ray/serve/_private/common.py:228: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    (HTTPProxyActor pid=49530) Use get_job_id() instead
    (HTTPProxyActor pid=49530)   "deployer_job_id": ray.get_runtime_context().job_id,
    (ServeReplica:Pinger pid=49534) Changing target URL from "" to "localhost:8000"
    (ServeReplica:Pinger pid=49534) /Users/shrekris/Desktop/ray/python/ray/serve/_private/replica.py:215: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    (ServeReplica:Pinger pid=49534) Use get_node_id() instead
    (ServeReplica:Pinger pid=49534)   return ray.get_runtime_context().node_id
    /Users/shrekris/Desktop/ray/python/ray/serve/_private/common.py:228: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
    Use get_job_id() instead
      "deployer_job_id": ray.get_runtime_context().job_id,
    2023-01-21 17:19:17,769	SUCC <string>:93 -- Deployed Serve app successfully.
    ```
    
    This change makes Serve use `get_job_id()` and `get_node_id()` as recommended. It also updates Serve internals to always treat the `job_id` and `node_id` as a string. This removes the warnings:
    
    ```console
    $ serve run example:graph
    2023-01-21 17:35:04,901	INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
    2023-01-21 17:35:11,193	INFO <string>:62 -- Deploying from import path: example:graph.
    2023-01-21 17:35:11,207	INFO worker.py:1244 -- Using address 127.0.0.1:63563 set in the environment variable RAY_ADDRESS
    2023-01-21 17:35:11,208	INFO worker.py:1366 -- Connecting to existing Ray cluster at address: 127.0.0.1:63563...
    2023-01-21 17:35:11,212	INFO worker.py:1546 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
    (ServeController pid=54348) INFO 2023-01-21 17:35:12,016 controller 54348 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-4d2397f0efde53dda4d529889e28d7c94561a4f2331cc402007dda5f' on node '4d2397f0efde53dda4d529889e28d7c94561a4f2331cc402007dda5f' listening on '127.0.0.1:8000'
    (HTTPProxyActor pid=54352) INFO:     Started server process [54352]
    (ServeController pid=54348) INFO 2023-01-21 17:35:12,853 controller 54348 deployment_state.py:1311 - Adding 1 replica to deployment 'Example'.
    2023-01-21 17:35:13,834	SUCC <string>:93 -- Deployed Serve app successfully.
    ```
    shrekris-anyscale authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    455100b View commit details
    Browse the repository at this point in the history
  13. [data] New executor [9/n]--- enforce resource limits during streaming…

    … execution (#31722)
    
    This implements resource limits for the new streaming executor backend. Resource limits are required to enable true streaming execution, i.e., otherwise it degrades to memory-inefficient bulk execution.
    
    Resource limits are implemented as follows:
    
    Each operator has methods to report a base, current, and incremental resource usage. The current and incremental resource usage can be dynamic depending on the state of the operator (e.g., actor pool state).
    The streaming executor queries the current resource usage and determines based on that which operators it is safe to dispatch new tasks for.
    By default, resource limits are autodetected based on the current cluster size (and updated as the cluster potentially autoscales).
    The edge cases here are around liveness and avoiding starvation.
    
    To ensure liveness, the streaming executor allows at least one task to run, regardless of the current resource usage.
    To avoid starvation, the streaming executor only allows tasks to require CPU or GPU, not both. It ignores the scale of resource requests, i.e., treating them as either 1 or 0. This ensures operators don't get starved due to the shape of their resource requests.
    Note that AllToAllOperators are currently out of scope. They return zero base/current/incremental resource usage, and hence are unmanaged.
    ericl authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    3e310f3 View commit details
    Browse the repository at this point in the history
  14. [Data] Set the num actor to 10 for xgboost batch prediction benchmark (

    …#31914)
    
    Autoscaling actor pool is not supported in new execution backend yet: #31723
    We temporarily set the actor pool size to 10 (same as the num workers) to unbreak the tests.
    
    Signed-off-by: jianoaix <[email protected]>
    jianoaix authored Jan 25, 2023
    Configuration menu
    Copy the full SHA
    fe65c3e View commit details
    Browse the repository at this point in the history