forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update #4
Merged
Merged
update #4
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…rs and reenable test (#31838) run_function_on_all_workers importing requires job_id to run properly. after #30883 the worker might not have job_id when startup, which lead to run_function_on_all_workers failed to be executed on start up. to fix it, we defer the import_thread start up until job_config is initialized.
… guide. (#31307) Fixes an outdated K8s configuration reference in the large cluster deployment guide. Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Alan Guo <[email protected]> Update to more closely match the design spec.
…False. (#31666) Signed-off-by: SangBin Cho <[email protected]> Currently, when the include_dashboard is False, there are 2 issues. From the output of ray.init & ray start, we still prints the dashboard URL. Although we properly disabled all the modules, we start the HTTP server (which means users can still access the UI although it doesn't work). This PR fixes both issues.
…#31846) Why are these changes needed? Previously the import thread starts as soon as the worker starts and imports the task/actor definition. After #31838, it is deferred until the first task is sent. That means we will have longer delay in the first execution. To address the problem, we can opportunistically start the import thread after the import thread is created, if the job_id does exist.
…ist tasks (#31776) This PR changes scheduling_state -> state, which allows us to render state count from the frontend (it is also consistent with other schema). Add duration to the frontend. Add profile / regular events to the task state API. Support source side filtering for job id There are remaining work in the follow up Replace timeline implementation to use task API Implement timeline frontend. Display events / profile events from the dashboard/state API in a better format.
Signed-off-by: Chong-Li <[email protected]> This PR tries to finalize gcs actor scheduler, with the following changes: Similar to the legacy scheduler, gcs now prefers the actor owner's node. It's usually required by RL cases for better colocation. The normal task workers report zero CPU resources (instead of the allocated one) when they are blocked. Similar to the legacy scheduler, gcs now schedules empty-resource actors randomly. A new release test is added: multiple masters/drivers are creating (slave) actors concurrently. The case is able to expose the difference between centralized (gcs-based, less scheduling conflicts) and distributed schedulers. The feature flag is temporarily turned on when going through the CI pipline. We still need another dedicated PR to turn it on by default.
The code has been migrated to the StoreClientKV in PR and the old one is not useful anymore. This PR delete the old code.
…work (#31825) To enable the new bulk execution backend: #30903 Based on the most recent test (https://buildkite.com/ray-project/oss-ci-build-pr/builds/9947#_), this should be last issue to fix it! (note the failure of Dataset tests is not real as all tests passing, some issue with bazel test)
…tches_benchmark_single_node test (#31864) The release test iter_tensor_batches_benchmark_single_node has failed the most recent run, due to the same issue discussed/addressed in #31752 and #31493 (the actual error message is: botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This PR updates one remaining test to match this convention. Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: rickyyx <[email protected]> we added job id in the log prefix in #31772, breaking the test. This fixes the test to reflect the change.
Signed-off-by: Bukic Tomislav, AVL <[email protected]> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR #31633
Some Nevergrad search algorithms have required inputs, such as `budget` for the `NgOpt` search algorithm, but it is not possible with the NevergradSearch class to pass these parameters down to the search algorithm. I would propose adding something like optimizer_kwargs to the NevergradSearch that get passed to the optimizer when instantiating it. Signed-off-by: yhna <[email protected]> Signed-off-by: YH <[email protected]> Signed-off-by: Younghwan Na <[email protected]> Signed-off-by: yhna940 <[email protected]> Co-authored-by: Justin Yu <[email protected]>
…shed tasks (#31761) This PR handles 2 edges when marking tasks as fail: When a job finishes but tasks still running should be marked as failed. Don't override a task's finished or failed timestamp when an ancestor failed. For 1: It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed For 2: It adds an ancestor_failed_ts to keep track of ancestor failure time in the task tree. This extra bit of info is necessary to keep since we should not be overriding any already failed or finished child tasks's timestamps. But we will also need to know if any task subtree has been traversed (and all non-terminated children marked as failed) w/o traversing the task tree. When adding a new task event, If the task fails or its ancestor failed, its failed_ts and ancestor_failed_ts will be set, and we will traverse into the child task tree. During the tree traversal, when a task has its failed_ts or ancestor_failed_ts set, this means its children should have been traversed when its failed_ts or ancestor_failed_ts was set.
Signed-off-by: amogkam [email protected] Closes #30333. Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310. However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333. Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!) Signed-off-by: amogkam <[email protected]>
we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.
- ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node - improve ray worker memory allocation computation. - refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values) Signed-off-by: Weichen Xu <[email protected]>
Python adds the current working directory as the first entry to `sys.path` . In some python versions, this will lead stdlib libraries to import modules in the working directory instead of other stdlib modules (python/cpython#101210). In our case, the conflict is `types.py`. Because `types.py` contains a PublicAPI we don't want to just rename it without proper deprecation. Instead we fix the `sys.path` in `setup-dev.py` as this is the only location where the problem predictably comes up. Signed-off-by: Kai Fricke <[email protected]>
Documents the new visualization for Serve deployment graphs using Gradio.
This PR adds an env flag to disable periodic autoscaler cluster status logging. Users may wish to disable these logs because: The log entry is rather long and communicates cluster status rather than events in the cluster. The log is multi-line, which does not interact well with some users' logging setups. You can get the same info by running ray status. Signed-off-by: Dmitri Gekhtman <[email protected]>
#31763 removes Checkpoint.from_object_ref, but doesn't remove the now-unused Checkpoint._object_ref. This PR cleans up the dead code. Signed-off-by: amogkam <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: amogkam <[email protected]>
#31662) Signed-off-by: Jun Gong <[email protected]> Co-authored-by: Richard Liaw <[email protected]>
Those unit tests are not running in Python CI cuz some testing APIs are not available from Python < 3.8. So Windows failure was actual failure from unit tests. However, they were just testing issue, not a real code issue. I fixed the test failure here. Signed-off-by: SangBin Cho <[email protected]>
In Python 3.11, coroutines are no longer allowed in `asyncio.wait`. Need to pass in a task instead.
Have train_loop_config logged as a config in wandb UI. This is done by surfacing Trainer._param_dict into Tuner.param_space. Currently, only done for train_loop_config. Signed-off-by: xwjiang2010 <[email protected]>
#31891) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
This PR fixes experiment restoration from a different cloud URI to save future results and checkpoints to new URI instead of continuing to write to the old location. The workflow of starting a local experiment, uploading the experiment dir to cloud, then restoring from the URI from a different cluster is also possible now. Signed-off-by: Justin Yu <[email protected]>
The regression is introduced by #30705. Also added some documentation into TorchTrainer so users know there is quite some magic happening :) Tested manually in workspace. Follow-up PR to add more strict assertions to the test. Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: SangBin Cho <[email protected]> Add worker id & pg id to the task state Add pg id to the actor state Add start / end time for worker state Add start / end time for node state
Initial implementation of ray-project/enhancements#18 Original prototype: https://github.com/ray-project/ray/pull/30222/files Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: jianoaix <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
…g objects) AND erroneous log messages. (#31854)
This feature is a helper tool to clean up the redis storage. It's meant to be a solution to cleanup the old data in redis stored by Ray until we got the work of GCS storage backend done. This feature support redis cluster and non cluster mode and is for redis cleanup only. The feature is built with cython, so no external libraries needed. Since _raylet.so depends on redis_client implicitly, so size change in the ray pkg.
The test failed because in the current ray redis client there is a global variable using io context. and the io context is freed during destruction and when destruct the global variable it'll cause issues. The current redis client is aweful and ugly. Since we'll move to redis plus plus, and this client doesn't cause other issues, I'll just fix the test case.
This PR make the test run with the new cloud to prevent regression.
Serve uses `ray.get_runtime_context().job_id` and `ray.get_runtime_context().node_id`, which are deprecated. This raises long `RayDeprecationWarning` in the Serve CLI: ```console $ serve run example:graph 2023-01-21 17:19:08,723 INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 ... 2023-01-21 17:19:15,149 INFO worker.py:1546 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (ServeController pid=49525) INFO 2023-01-21 17:19:15,949 controller 49525 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-79e5c05fd95270ecf67b0d43e9aef485109c6724cf07121b46e361c0' on node '79e5c05fd95270ecf67b0d43e9aef485109c6724cf07121b46e361c0' listening on '127.0.0.1:8000' (HTTPProxyActor pid=49530) INFO: Started server process [49530] /Users/shrekris/Desktop/ray/python/ray/serve/_private/client.py:487: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" Use get_job_id() instead "deployer_job_id": ray.get_runtime_context().job_id, (ServeController pid=49525) INFO 2023-01-21 17:19:16,789 controller 49525 deployment_state.py:1311 - Adding 1 replica to deployment 'Pinger'. (HTTPProxyActor pid=49530) /Users/shrekris/Desktop/ray/python/ray/serve/_private/common.py:228: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" (HTTPProxyActor pid=49530) Use get_job_id() instead (HTTPProxyActor pid=49530) "deployer_job_id": ray.get_runtime_context().job_id, (ServeReplica:Pinger pid=49534) Changing target URL from "" to "localhost:8000" (ServeReplica:Pinger pid=49534) /Users/shrekris/Desktop/ray/python/ray/serve/_private/replica.py:215: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" (ServeReplica:Pinger pid=49534) Use get_node_id() instead (ServeReplica:Pinger pid=49534) return ray.get_runtime_context().node_id /Users/shrekris/Desktop/ray/python/ray/serve/_private/common.py:228: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" Use get_job_id() instead "deployer_job_id": ray.get_runtime_context().job_id, 2023-01-21 17:19:17,769 SUCC <string>:93 -- Deployed Serve app successfully. ``` This change makes Serve use `get_job_id()` and `get_node_id()` as recommended. It also updates Serve internals to always treat the `job_id` and `node_id` as a string. This removes the warnings: ```console $ serve run example:graph 2023-01-21 17:35:04,901 INFO worker.py:1546 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 2023-01-21 17:35:11,193 INFO <string>:62 -- Deploying from import path: example:graph. 2023-01-21 17:35:11,207 INFO worker.py:1244 -- Using address 127.0.0.1:63563 set in the environment variable RAY_ADDRESS 2023-01-21 17:35:11,208 INFO worker.py:1366 -- Connecting to existing Ray cluster at address: 127.0.0.1:63563... 2023-01-21 17:35:11,212 INFO worker.py:1546 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (ServeController pid=54348) INFO 2023-01-21 17:35:12,016 controller 54348 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-4d2397f0efde53dda4d529889e28d7c94561a4f2331cc402007dda5f' on node '4d2397f0efde53dda4d529889e28d7c94561a4f2331cc402007dda5f' listening on '127.0.0.1:8000' (HTTPProxyActor pid=54352) INFO: Started server process [54352] (ServeController pid=54348) INFO 2023-01-21 17:35:12,853 controller 54348 deployment_state.py:1311 - Adding 1 replica to deployment 'Example'. 2023-01-21 17:35:13,834 SUCC <string>:93 -- Deployed Serve app successfully. ```
… execution (#31722) This implements resource limits for the new streaming executor backend. Resource limits are required to enable true streaming execution, i.e., otherwise it degrades to memory-inefficient bulk execution. Resource limits are implemented as follows: Each operator has methods to report a base, current, and incremental resource usage. The current and incremental resource usage can be dynamic depending on the state of the operator (e.g., actor pool state). The streaming executor queries the current resource usage and determines based on that which operators it is safe to dispatch new tasks for. By default, resource limits are autodetected based on the current cluster size (and updated as the cluster potentially autoscales). The edge cases here are around liveness and avoiding starvation. To ensure liveness, the streaming executor allows at least one task to run, regardless of the current resource usage. To avoid starvation, the streaming executor only allows tasks to require CPU or GPU, not both. It ignores the scale of resource requests, i.e., treating them as either 1 or 0. This ensures operators don't get starved due to the shape of their resource requests. Note that AllToAllOperators are currently out of scope. They return zero base/current/incremental resource usage, and hence are unmanaged.
…#31914) Autoscaling actor pool is not supported in new execution backend yet: #31723 We temporarily set the actor pool size to 10 (same as the num workers) to unbreak the tests. Signed-off-by: jianoaix <[email protected]>
jcoffi
pushed a commit
that referenced
this pull request
Mar 19, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.