update #4

jcoffi · 2023-01-25T23:53:11Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…rs and reenable test (#31838) run_function_on_all_workers importing requires job_id to run properly. after #30883 the worker might not have job_id when startup, which lead to run_function_on_all_workers failed to be executed on start up. to fix it, we defer the import_thread start up until job_config is initialized.

… guide. (#31307) Fixes an outdated K8s configuration reference in the large cluster deployment guide. Signed-off-by: Dmitri Gekhtman <[email protected]>

Signed-off-by: Alan Guo <[email protected]> Update to more closely match the design spec.

…False. (#31666) Signed-off-by: SangBin Cho <[email protected]> Currently, when the include_dashboard is False, there are 2 issues. From the output of ray.init & ray start, we still prints the dashboard URL. Although we properly disabled all the modules, we start the HTTP server (which means users can still access the UI although it doesn't work). This PR fixes both issues.

…#31846) Why are these changes needed? Previously the import thread starts as soon as the worker starts and imports the task/actor definition. After #31838, it is deferred until the first task is sent. That means we will have longer delay in the first execution. To address the problem, we can opportunistically start the import thread after the import thread is created, if the job_id does exist.

…ist tasks (#31776) This PR changes scheduling_state -> state, which allows us to render state count from the frontend (it is also consistent with other schema). Add duration to the frontend. Add profile / regular events to the task state API. Support source side filtering for job id There are remaining work in the follow up Replace timeline implementation to use task API Implement timeline frontend. Display events / profile events from the dashboard/state API in a better format.

Signed-off-by: Chong-Li <[email protected]> This PR tries to finalize gcs actor scheduler, with the following changes: Similar to the legacy scheduler, gcs now prefers the actor owner's node. It's usually required by RL cases for better colocation. The normal task workers report zero CPU resources (instead of the allocated one) when they are blocked. Similar to the legacy scheduler, gcs now schedules empty-resource actors randomly. A new release test is added: multiple masters/drivers are creating (slave) actors concurrently. The case is able to expose the difference between centralized (gcs-based, less scheduling conflicts) and distributed schedulers. The feature flag is temporarily turned on when going through the CI pipline. We still need another dedicated PR to turn it on by default.

…nd imap_unordered methods of ray.util.multiprocessing.Pool (#31845) Context: #24237 We will raise this warning in Ray 2.3. For Ray 2.4, we will merge #31799 which causes a TypeError to be raised instead.

The code has been migrated to the StoreClientKV in PR and the old one is not useful anymore. This PR delete the old code.

…31850)

…work (#31825) To enable the new bulk execution backend: #30903 Based on the most recent test (https://buildkite.com/ray-project/oss-ci-build-pr/builds/9947#_), this should be last issue to fix it! (note the failure of Dataset tests is not real as all tests passing, some issue with bazel test)

This reverts commit 91b632b.

…tches_benchmark_single_node test (#31864) The release test iter_tensor_batches_benchmark_single_node has failed the most recent run, due to the same issue discussed/addressed in #31752 and #31493 (the actual error message is: botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This PR updates one remaining test to match this convention. Signed-off-by: Scott Lee <[email protected]>

This reverts commit 58386d0.

Signed-off-by: rickyyx <[email protected]> we added job id in the log prefix in #31772, breaking the test. This fixes the test to reflect the change.

Signed-off-by: Bukic Tomislav, AVL <[email protected]> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR #31633

Some Nevergrad search algorithms have required inputs, such as `budget` for the `NgOpt` search algorithm, but it is not possible with the NevergradSearch class to pass these parameters down to the search algorithm. I would propose adding something like optimizer_kwargs to the NevergradSearch that get passed to the optimizer when instantiating it. Signed-off-by: yhna <[email protected]> Signed-off-by: YH <[email protected]> Signed-off-by: Younghwan Na <[email protected]> Signed-off-by: yhna940 <[email protected]> Co-authored-by: Justin Yu <[email protected]>

This reverts commit 608276b. Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).

…shed tasks (#31761) This PR handles 2 edges when marking tasks as fail: When a job finishes but tasks still running should be marked as failed. Don't override a task's finished or failed timestamp when an ancestor failed. For 1: It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed For 2: It adds an ancestor_failed_ts to keep track of ancestor failure time in the task tree. This extra bit of info is necessary to keep since we should not be overriding any already failed or finished child tasks's timestamps. But we will also need to know if any task subtree has been traversed (and all non-terminated children marked as failed) w/o traversing the task tree. When adding a new task event, If the task fails or its ancestor failed, its failed_ts and ancestor_failed_ts will be set, and we will traverse into the child task tree. During the tree traversal, when a task has its failed_ts or ancestor_failed_ts set, this means its children should have been traversed when its failed_ts or ancestor_failed_ts was set.

@cadedaniel

Signed-off-by: amogkam [email protected] Closes #30333. Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310. However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333. Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!) Signed-off-by: amogkam <[email protected]>

we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.

- ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node - improve ray worker memory allocation computation. - refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values) Signed-off-by: Weichen Xu <[email protected]>

Python adds the current working directory as the first entry to `sys.path` . In some python versions, this will lead stdlib libraries to import modules in the working directory instead of other stdlib modules (python/cpython#101210). In our case, the conflict is `types.py`. Because `types.py` contains a PublicAPI we don't want to just rename it without proper deprecation. Instead we fix the `sys.path` in `setup-dev.py` as this is the only location where the problem predictably comes up. Signed-off-by: Kai Fricke <[email protected]>

Documents the new visualization for Serve deployment graphs using Gradio.

This PR adds an env flag to disable periodic autoscaler cluster status logging. Users may wish to disable these logs because: The log entry is rather long and communicates cluster status rather than events in the cluster. The log is multi-line, which does not interact well with some users' logging setups. You can get the same info by running ray status. Signed-off-by: Dmitri Gekhtman <[email protected]>

#31763 removes Checkpoint.from_object_ref, but doesn't remove the now-unused Checkpoint._object_ref. This PR cleans up the dead code. Signed-off-by: amogkam <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: amogkam <[email protected]>

#31662) Signed-off-by: Jun Gong <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

Those unit tests are not running in Python CI cuz some testing APIs are not available from Python < 3.8. So Windows failure was actual failure from unit tests. However, they were just testing issue, not a real code issue. I fixed the test failure here. Signed-off-by: SangBin Cho <[email protected]>

In Python 3.11, coroutines are no longer allowed in `asyncio.wait`. Need to pass in a task instead.

Have train_loop_config logged as a config in wandb UI. This is done by surfacing Trainer._param_dict into Tuner.param_space. Currently, only done for train_loop_config. Signed-off-by: xwjiang2010 <[email protected]>

#31891) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

This PR fixes experiment restoration from a different cloud URI to save future results and checkpoints to new URI instead of continuing to write to the old location. The workflow of starting a local experiment, uploading the experiment dir to cloud, then restoring from the URI from a different cluster is also possible now. Signed-off-by: Justin Yu <[email protected]>

The regression is introduced by #30705. Also added some documentation into TorchTrainer so users know there is quite some magic happening :) Tested manually in workspace. Follow-up PR to add more strict assertions to the test. Signed-off-by: xwjiang2010 <[email protected]>

Signed-off-by: SangBin Cho <[email protected]> Add worker id & pg id to the task state Add pg id to the actor state Add start / end time for worker state Add start / end time for node state

Initial implementation of ray-project/enhancements#18 Original prototype: https://github.com/ray-project/ray/pull/30222/files Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: jianoaix <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…g objects) AND erroneous log messages. (#31854)

This feature is a helper tool to clean up the redis storage. It's meant to be a solution to cleanup the old data in redis stored by Ray until we got the work of GCS storage backend done. This feature support redis cluster and non cluster mode and is for redis cleanup only. The feature is built with cython, so no external libraries needed. Since _raylet.so depends on redis_client implicitly, so size change in the ray pkg.

The test failed because in the current ray redis client there is a global variable using io context. and the io context is freed during destruction and when destruct the global variable it'll cause issues. The current redis client is aweful and ugly. Since we'll move to redis plus plus, and this client doesn't cause other issues, I'll just fix the test case.

This PR make the test run with the new cloud to prevent regression.

Serve uses `ray.get_runtime_context().job_id` and `ray.get_runtime_context().node_id`, which are deprecated. This raises long `RayDeprecationWarning` in the Serve CLI: ```console $ serve run example:graph 2023-01-21 17:19:08,723 INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 ... 2023-01-21 17:19:15,149 INFO worker.py:1546 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (ServeController pid=49525) INFO 2023-01-21 17:19:15,949 controller 49525 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-79e5c05fd95270ecf67b0d43e9aef485109c6724cf07121b46e361c0' on node '79e5c05fd95270ecf67b0d43e9aef485109c6724cf07121b46e361c0' listening on '127.0.0.1:8000' (HTTPProxyActor pid=49530) INFO: Started server process [49530] /Users/shrekris/Desktop/ray/python/ray/serve/_private/client.py:487: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" Use get_job_id() instead "deployer_job_id": ray.get_runtime_context().job_id, (ServeController pid=49525) INFO 2023-01-21 17:19:16,789 controller 49525 deployment_state.py:1311 - Adding 1 replica to deployment 'Pinger'. (HTTPProxyActor pid=49530) /Users/shrekris/Desktop/ray/python/ray/serve/_private/common.py:228: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" (HTTPProxyActor pid=49530) Use get_job_id() instead (HTTPProxyActor pid=49530) "deployer_job_id": ray.get_runtime_context().job_id, (ServeReplica:Pinger pid=49534) Changing target URL from "" to "localhost:8000" (ServeReplica:Pinger pid=49534) /Users/shrekris/Desktop/ray/python/ray/serve/_private/replica.py:215: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" (ServeReplica:Pinger pid=49534) Use get_node_id() instead (ServeReplica:Pinger pid=49534) return ray.get_runtime_context().node_id /Users/shrekris/Desktop/ray/python/ray/serve/_private/common.py:228: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" Use get_job_id() instead "deployer_job_id": ray.get_runtime_context().job_id, 2023-01-21 17:19:17,769 SUCC <string>:93 -- Deployed Serve app successfully. ``` This change makes Serve use `get_job_id()` and `get_node_id()` as recommended. It also updates Serve internals to always treat the `job_id` and `node_id` as a string. This removes the warnings: ```console $ serve run example:graph 2023-01-21 17:35:04,901 INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 2023-01-21 17:35:11,193 INFO <string>:62 -- Deploying from import path: example:graph. 2023-01-21 17:35:11,207 INFO worker.py:1244 -- Using address 127.0.0.1:63563 set in the environment variable RAY_ADDRESS 2023-01-21 17:35:11,208 INFO worker.py:1366 -- Connecting to existing Ray cluster at address: 127.0.0.1:63563... 2023-01-21 17:35:11,212 INFO worker.py:1546 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (ServeController pid=54348) INFO 2023-01-21 17:35:12,016 controller 54348 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-4d2397f0efde53dda4d529889e28d7c94561a4f2331cc402007dda5f' on node '4d2397f0efde53dda4d529889e28d7c94561a4f2331cc402007dda5f' listening on '127.0.0.1:8000' (HTTPProxyActor pid=54352) INFO: Started server process [54352] (ServeController pid=54348) INFO 2023-01-21 17:35:12,853 controller 54348 deployment_state.py:1311 - Adding 1 replica to deployment 'Example'. 2023-01-21 17:35:13,834 SUCC <string>:93 -- Deployed Serve app successfully. ```

… execution (#31722) This implements resource limits for the new streaming executor backend. Resource limits are required to enable true streaming execution, i.e., otherwise it degrades to memory-inefficient bulk execution. Resource limits are implemented as follows: Each operator has methods to report a base, current, and incremental resource usage. The current and incremental resource usage can be dynamic depending on the state of the operator (e.g., actor pool state). The streaming executor queries the current resource usage and determines based on that which operators it is safe to dispatch new tasks for. By default, resource limits are autodetected based on the current cluster size (and updated as the cluster potentially autoscales). The edge cases here are around liveness and avoiding starvation. To ensure liveness, the streaming executor allows at least one task to run, regardless of the current resource usage. To avoid starvation, the streaming executor only allows tasks to require CPU or GPU, not both. It ignores the scale of resource requests, i.e., treating them as either 1 or 0. This ensures operators don't get starved due to the shape of their resource requests. Note that AllToAllOperators are currently out of scope. They return zero base/current/incremental resource usage, and hence are unmanaged.

…#31914) Autoscaling actor pool is not supported in new execution backend yet: #31723 We temporarily set the actor pool size to 10 (same as the num workers) to unbreak the tests. Signed-off-by: jianoaix <[email protected]>

scv119 and others added 30 commits January 22, 2023 10:41

[docs] Update reference K8s configuration in large cluster deployment…

c8c1da7

… guide. (#31307) Fixes an outdated K8s configuration reference in the large cluster deployment guide. Signed-off-by: Dmitri Gekhtman <[email protected]>

Polish the new IA for dashboard (#31770)

cdb3780

Signed-off-by: Alan Guo <[email protected]> Update to more closely match the design spec.

[Core] Raise deprecation warning when passing non-iterables to imap a…

96854d5

…nd imap_unordered methods of ray.util.multiprocessing.Pool (#31845) Context: #24237 We will raise this warning in Ray 2.3. For Ray 2.4, we will merge #31799 which causes a TypeError to be raised instead.

[core] Delete old internal kv gcs (#31841)

9ab6421

The code has been migrated to the StoreClientKV in PR and the old one is not useful anymore. This PR delete the old code.

[RLlib] MultiAgentEnv.reset() does not call super.reset(seed=...). (#…

7c58114

…31850)

Revert "[AIR] Deprecations for 2.3 (#31763)" (#31866)

58386d0

This reverts commit 91b632b.

Revert "Revert "[AIR] Deprecations for 2.3 (#31763)" (#31866)" (#31867)

46b3bef

This reverts commit 58386d0.

[ci][core] Fix state api large scale test log prefix mismath (#31865)

3b09a54

Signed-off-by: rickyyx <[email protected]> we added job id in the log prefix in #31772, breaking the test. This fixes the test to reflect the change.

Documentation about using existing Grafana instance. (#31667)

ee23cc8

Signed-off-by: Bukic Tomislav, AVL <[email protected]> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR #31633

Revert "Simplify logging configuration. (#30863)" (#31858)

0c69020

This reverts commit 608276b. Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).

[Core] temporary stop testing python grpcio prerelease package (#31873)

d28007d

we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.

[serve][docs] Document Gradio visualization (#28310)

e7aabe8

Documents the new visualization for Serve deployment graphs using Gradio.

Fixes for the new bulk execution backend (#31884)

5682ef9

[RLlib] Document RLlib fault tolerance and elastic training behaviors. (

0a379af

#31662) Signed-off-by: Jun Gong <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

[serve][python3.11] Use tasks/futures for asyncio.wait (#31608)

8c88bdc

In Python 3.11, coroutines are no longer allowed in `asyncio.wait`. Need to pass in a task instead.

xwjiang2010 and others added 19 commits January 24, 2023 14:08

[wandb] Have train_loop_config logged as a config. (#31901)

50c0395

Have train_loop_config logged as a config in wandb UI. This is done by surfacing Trainer._param_dict into Tuner.param_space. Currently, only done for train_loop_config. Signed-off-by: xwjiang2010 <[email protected]>

[RLlib] Introduce ModuleSpecs to construct TrainerRunner from Policies (

73c60fd

#31891) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[State API] Improve task api (#31847)

c464207

Signed-off-by: SangBin Cho <[email protected]> Add worker id & pg id to the task state Add pg id to the actor state Add start / end time for worker state Add start / end time for node state

[WIP] Bulk executor initial implementation (#30903)

877770e

Initial implementation of ray-project/enhancements#18 Original prototype: https://github.com/ray-project/ray/pull/30222/files Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: jianoaix <[email protected]>

[RLlib] [Doc ]fix the broken Tune reference (#31918)

dbb3a5f

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[docs] fix small train render issue (#31735)

4cb0081

[docs] add volcano integration (#31699)

c85e453

[docs] maintain scrollbar position on redirects, fixes #31789 (#31804)

7018e1e

[docs] typos in installation.rst (#31903)

220b7cc

[RLlib] Clean up some algorithm.py config dict uses (replace by confi…

477910b

…g objects) AND erroneous log messages. (#31854)

[Serve] Document end-to-end timeout in Serve (#31769)

6aec692

[core] Migrate many_nodes_actor_tests to new cloud. (#31863)

d9dd326

This PR make the test run with the new cloud to prevent regression.

jcoffi merged commit b24933d into jcoffi:master Jan 25, 2023

jcoffi pushed a commit that referenced this pull request Mar 19, 2023

Streaming executor fixes #4 (ray-project#32882)

2875357

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update #4

update #4

jcoffi commented Jan 25, 2023

update #4

update #4

Conversation

jcoffi commented Jan 25, 2023

Why are these changes needed?

Related issue number

Checks