Make it possible to use Ray distributed debugger without setting RAY_…

…DEBUG (ray-project#48301)   ## Why are these changes needed? Currently in order to use the distributed debugger, the user has to set `RAY_DEBUG=1`. This has two disadvantages: 1. It is disruptive to the workflow and much more overhead than just adding the `breakpoint()` instruction and re-running the program (since the runtime environment has to be updated and the user needs to make sure that the driver uses the flag too e.g. by restarting the python kernel or in the worst case the container). 2. It is very easy to forget this step and then get the impression that the debugger is not working. There is no reason to require `RAY_DEBUG=1` to be set (the CLI debugger works without the flag too and in particular the flag has no impact on performance unless the debugger is actually entered). The reason this flag was originally introduced was as a feature flag to switch between the CLI debugger and the UI debugger. Now that the UI debugger is getting more mature, it is better to make it the default and let people who want to use the CLI debugger use a `RAY_DEBUG=legacy` flag. This PR also renames the `RAY_PDB` flag to `RAY_DEBUG_POST_MORTEM` and unifies the usage of the flag between the old and new debugger (in particular, with the new debugger, post mortem debugging is now off unless the user activates it). ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Philipp Moritz <[email protected]>
JP-sDEV · Nov 14, 2024 · 4ece73c · 4ece73c
1 parent 0ec1493
commit 4ece73c
Show file tree

Hide file tree

Showing 13 changed files with 113 additions and 137 deletions.
diff --git a/doc/source/ray-observability/doc_code/ray-distributed-debugger.py b/doc/source/ray-observability/doc_code/ray-distributed-debugger.py
@@ -1,10 +1,11 @@
 import ray
 import sys
 
-# Add RAY_DEBUG environment variable to enable Ray Debugger.
+# Add the RAY_DEBUG_POST_MORTEM=1 environment variable
+# if you want to activate post-mortem debugging
 ray.init(
     runtime_env={
-        "env_vars": {"RAY_DEBUG": "1"},
+        "env_vars": {"RAY_DEBUG_POST_MORTEM": "1"},
     }
 )
 

diff --git a/doc/source/ray-observability/ray-distributed-debugger.rst b/doc/source/ray-observability/ray-distributed-debugger.rst
@@ -1,3 +1,5 @@
+.. _ray-distributed-debugger:
+
 Ray Distributed Debugger
 ========================
 
@@ -48,7 +50,7 @@ Find and click the Ray extension in the VS Code left side nav. Add the Ray clust
 Create a Ray task
 ~~~~~~~~~~~~~~~~~
 
-Create a file `job.py` with the following snippet. Add the `RAY_DEBUG` environment variable to enable Ray Debugger and add `breakpoint()` in the Ray task.
+Create a file `job.py` with the following snippet. Add `breakpoint()` in the Ray task. If you want to use the post-mortem debugging below, also add the `RAY_DEBUG_POST_MORTEM=1` environment variable.
 
 .. literalinclude:: ./doc_code/ray-distributed-debugger.py
     :language: python

diff --git a/doc/source/ray-observability/user-guides/debug-apps/index.md b/doc/source/ray-observability/user-guides/debug-apps/index.md
@@ -10,6 +10,7 @@ debug-memory
 debug-hangs
 debug-failures
 optimize-performance
+../../ray-distributed-debugger
 ray-debugging
 ```
 
@@ -19,4 +20,5 @@ These guides help you perform common debugging or optimization tasks for your di
 * {ref}`observability-debug-hangs`
 * {ref}`observability-debug-failures`
 * {ref}`observability-optimize-performance`
-* {ref}`ray-debugger`
+* {ref}`ray-distributed-debugger`
+* {ref}`ray-debugger` (deprecated)
diff --git a/doc/source/ray-observability/user-guides/debug-apps/ray-debugging.rst b/doc/source/ray-observability/user-guides/debug-apps/ray-debugging.rst
@@ -14,22 +14,21 @@ drop into a PDB session that you can then use to:
 .. warning::
 
     The Ray Debugger is deprecated. Use the :doc:`Ray Distributed Debugger <../../ray-distributed-debugger>` instead.
+    Starting with Ray 2.39, the new debugger is the default and you need to set the environment variable `RAY_DEBUG=legacy` to
+    use the old debugger (e.g. by using a runtime environment).
 
 Getting Started
 ---------------
 
-.. note::
-
-    On Python 3.6, the ``breakpoint()`` function is not supported and you need to use
-    ``ray.util.pdb.set_trace()`` instead.
-
 Take the following example:
 
 .. testcode::
     :skipif: True
 
     import ray
 
+    ray.init(runtime_env={"env_vars": {"RAY_DEBUG": "legacy"}})
+
     @ray.remote
     def f(x):
         breakpoint()
@@ -118,6 +117,8 @@ following recursive function as an example:
 
     import ray
 
+    ray.init(runtime_env={"env_vars": {"RAY_DEBUG": "legacy"}})
+
     @ray.remote
     def fact(n):
         if n == 1:
@@ -222,104 +223,54 @@ Post Mortem Debugging
 Often we do not know in advance where an error happens, so we cannot set a breakpoint. In these cases,
 we can automatically drop into the debugger when an error occurs or an exception is thrown. This is called *post-mortem debugging*.
 
-We will show how this works using a Ray serve application. To get started, install the required dependencies:
-
-.. code-block:: bash
-
-    pip install "ray[serve]" scikit-learn
-
-Next, copy the following code into a file called ``serve_debugging.py``:
+Copy the following code into a file called ``post_mortem_debugging.py``. The flag ``RAY_DEBUG_POST_MORTEM=1`` will have the effect
+that if an exception happens, Ray will drop into the debugger instead of propagating it further.
 
 .. testcode::
     :skipif: True
 
-    import time
-
-    from sklearn.datasets import load_iris
-    from sklearn.ensemble import GradientBoostingClassifier
-
     import ray
-    from ray import serve
-
-    serve.start()
 
-    # Train model
-    iris_dataset = load_iris()
-    model = GradientBoostingClassifier()
-    model.fit(iris_dataset["data"], iris_dataset["target"])
+    ray.init(runtime_env={"env_vars": {"RAY_DEBUG": "legacy", "RAY_DEBUG_POST_MORTEM": "1"}})
 
-    # Define Ray Serve model,
-    @serve.deployment
-    class BoostingModel:
-        def __init__(self):
-            self.model = model
-            self.label_list = iris_dataset["target_names"].tolist()
-
-        async def __call__(self, starlette_request):
-            payload = (await starlette_request.json())["vector"]
-            print(f"Worker: received request with data: {payload}")
-
-            prediction = self.model.predict([payload])[0]
-            human_name = self.label_list[prediction]
-            return {"result": human_name}
-
-    # Deploy model
-    serve.run(BoostingModel.bind(), route_prefix="/iris")
-
-    time.sleep(3600.0)
-
-Let's start the program with the post-mortem debugging activated (``RAY_PDB=1``):
-
-.. code-block:: bash
+    @ray.remote
+    def post_mortem(x):
+        x += 1
+        raise Exception("An exception is raised.")
+        return x
 
-    RAY_PDB=1 python serve_debugging.py
+    ray.get(post_mortem.remote(10))
 
-The flag ``RAY_PDB=1`` will have the effect that if an exception happens, Ray will
-drop into the debugger instead of propagating it further. Let's see how this works!
-First query the model with an invalid request using
+Let's start the program:
 
 .. code-block:: bash
 
-    python -c 'import requests; response = requests.get("http://localhost:8000/iris", json={"vector": [1.2, 1.0, 1.1, "a"]})'
+    python post_mortem_debugging.py
 
-When the ``serve_debugging.py`` driver hits the breakpoint, it will tell you to run
-``ray debug``. After we do that, we see an output like the following:
+Now run ``ray debug``. After we do that, we see an output like the following:
 
 .. code-block:: text
 
     Active breakpoints:
-    index | timestamp           | Ray task                                     | filename:lineno
-    0     | 2021-07-13 23:49:14 | ray::RayServeWrappedReplica.handle_request() | /home/ubuntu/ray/python/ray/serve/backend_worker.py:249
+    index | timestamp           | Ray task                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | filename:lineno
+    0     | 2024-11-01 20:14:00 | /Users/pcmoritz/ray/python/ray/_private/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=49606 --object-store-name=/tmp/ray/session_2024-11-01_13-13-51_279910_8596/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-11-01_13-13-51_279910_8596/sockets/raylet --redis-address=None --metrics-agent-port=58655 --runtime-env-agent-port=56999 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --runtime-env-agent-port=56999 --gcs-address=127.0.0.1:6379 --session-name=session_2024-11-01_13-13-51_279910_8596 --temp-dir=/tmp/ray --webui=127.0.0.1:8265 --cluster-id=6d341469ae0f85b6c3819168dde27cceda12e95c8efdfc256e0fd8ce --startup-token=12 --worker-launch-time-ms=1730492039955 --node-id=0d43573a606286125da39767a52ce45ad101324c8af02cc25a9fbac7 --runtime-env-hash=-1746935720 | /Users/pcmoritz/ray/python/ray/_private/worker.py:920
     Traceback (most recent call last):
 
-      File "/home/ubuntu/ray/python/ray/serve/backend_worker.py", line 242, in invoke_single
-        result = await method_to_call(*args, **kwargs)
-
-      File "serve_debugging.py", line 24, in __call__
-        prediction = self.model.predict([payload])[0]
-
-      File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_gb.py", line 1188, in predict
-        raw_predictions = self.decision_function(X)
-
-      File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_gb.py", line 1143, in decision_function
-        X = check_array(X, dtype=DTYPE, order="C", accept_sparse='csr')
+    File "python/ray/_raylet.pyx", line 1856, in ray._raylet.execute_task
 
-      File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
-        return f(*args, **kwargs)
+    File "python/ray/_raylet.pyx", line 1957, in ray._raylet.execute_task
 
-      File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 673, in check_array
-        array = np.asarray(array, order=order, dtype=dtype)
+    File "python/ray/_raylet.pyx", line 1862, in ray._raylet.execute_task
 
-      File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 83, in asarray
-        return array(a, dtype, copy=False, order=order)
+    File "/Users/pcmoritz/ray-debugger-test/post_mortem_debugging.py", line 8, in post_mortem
+        raise Exception("An exception is raised.")
 
-    ValueError: could not convert string to float: 'a'
+    Exception: An exception is raised.
 
     Enter breakpoint index or press enter to refresh:
 
 We now press ``0`` and then Enter to enter the debugger. With ``ll`` we can see the context and with
-``print(a)`` we an print the array that causes the problem. As we see, it contains a string (``'a'``)
-instead of a number as the last element.
+``print(x)`` we an print the value of ``x``.
 
 In a similar manner as above, you can also debug Ray actors. Happy debugging!
 

diff --git a/python/ray/_raylet.pyx b/python/ray/_raylet.pyx
@@ -1026,7 +1026,7 @@ cdef store_task_errors(
         CoreWorker core_worker = worker.core_worker
 
     # If the debugger is enabled, drop into the remote pdb here.
-    if ray.util.pdb._is_ray_debugger_enabled():
+    if ray.util.pdb._is_ray_debugger_post_mortem_enabled():
         ray.util.pdb._post_mortem()
 
     backtrace = ray._private.utils.format_error_message(

diff --git a/python/ray/data/_internal/planner/plan_udf_map_op.py b/python/ray/data/_internal/planner/plan_udf_map_op.py
@@ -45,7 +45,7 @@
 )
 from ray.data.context import DataContext
 from ray.data.exceptions import UserCodeException
-from ray.util.rpdb import _is_ray_debugger_enabled
+from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled
 
 
 class _MapActorContext:
@@ -205,7 +205,7 @@ def _handle_debugger_exception(e: Exception):
     so that the debugger can stop at the initial unhandled exception.
     Otherwise, clear the stack trace to omit noisy internal code path."""
     ctx = ray.data.DataContext.get_current()
-    if _is_ray_debugger_enabled() or ctx.raise_original_map_exception:
+    if _is_ray_debugger_post_mortem_enabled() or ctx.raise_original_map_exception:
         raise e
     else:
         raise UserCodeException() from e

diff --git a/python/ray/data/exceptions.py b/python/ray/data/exceptions.py
@@ -6,7 +6,7 @@
 from ray.exceptions import UserCodeException
 from ray.util import log_once
 from ray.util.annotations import DeveloperAPI
-from ray.util.rpdb import _is_ray_debugger_enabled
+from ray.util.rpdb import _is_ray_debugger_post_mortem_enabled
 
 logger = logging.getLogger(__name__)
 
@@ -52,7 +52,7 @@ def handle_trace(*args, **kwargs):
             # via DataContext, or when the Ray Debugger is enabled.
             # The full stack trace will always be emitted to the Ray Data log file.
             log_to_stdout = DataContext.get_current().log_internal_stack_trace_to_stdout
-            if _is_ray_debugger_enabled():
+            if _is_ray_debugger_post_mortem_enabled():
                 logger.exception("Full stack trace:")
                 raise e
 

diff --git a/python/ray/data/tests/test_exceptions.py b/python/ray/data/tests/test_exceptions.py
@@ -77,7 +77,7 @@ class FakeException(Exception):
 def test_full_traceback_logged_with_ray_debugger(
     caplog, propagate_logs, ray_start_regular_shared, monkeypatch
 ):
-    monkeypatch.setenv("RAY_PDB", 1)
+    monkeypatch.setenv("RAY_DEBUG_POST_MORTEM", 1)
 
     def f(row):
         1 / 0

diff --git a/python/ray/serve/_private/replica.py b/python/ray/serve/_private/replica.py
@@ -415,7 +415,7 @@ def _wrap_user_method_call(
         except Exception as e:
             user_exception = e
             logger.exception("Request failed.")
-            if ray.util.pdb._is_ray_debugger_enabled():
+            if ray.util.pdb._is_ray_debugger_post_mortem_enabled():
                 ray.util.pdb._post_mortem()
         finally:
             self._metrics_manager.dec_num_ongoing_requests()

diff --git a/python/ray/tests/test_output.py b/python/ray/tests/test_output.py
@@ -569,7 +569,8 @@ def test_disable_driver_logs_breakpoint():
 import sys
 import threading
 
-ray.init(num_cpus=2)
+os.environ["RAY_DEBUG"] = "legacy"
+ray.init(num_cpus=2, runtime_env={"env_vars": {"RAY_DEBUG": "legacy"}})
 
 @ray.remote
 def f():
@@ -588,8 +589,7 @@ def kill():
 t.start()
 x = f.remote()
 time.sleep(2)  # Enough time to print one hello.
-ray.util.rpdb._driver_set_trace()  # This should disable worker logs.
-# breakpoint()  # Only works in Py3.7+
+breakpoint()  # This should disable worker logs.
     """
 
     proc = run_string_as_driver_nonblocking(script)