Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Set up CI with Azure Pipelines Specifically, we are setting a travis like ADO pipeline following what is already present in the .travis.yml file in the root of the repo. * Separating travis like pipeline from main pipeline * Adding Jenkings jobs equivalent * Making some improvements * Adding validation of the upstream CI * Disabling Tune and large memory tests * Changing threshold for simple reservoir sampling test * Addressing comments * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with more travis updates * Updating CI with new cpp worker tests * Setting code owners * Fixing the version number generation * Making main pipeline also our release pipeline * Updating Azure Pipelines with travis updates * Fixing wheels test * Fixing codeowners * Updating Azure Pipelines with travis updates * Bumping up MACOSX_DEPLOYMENT_TARGET * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Disabling Serve tests * Making explicit which branches GitHubActions workflows should watch * Desabling Ray serve tests * Installing numpy explicitly * consolidating Ray test steps in one yml * Syncing with upstream master 2020-07-30 (#21) * [Core] Enhance common client connection (#9367) * enhance client connection * add write buffer async * read message * add test * Bazel move more shell to native rules (#9314) Co-authored-by: Mehrdad <[email protected]> * [tune] Fix github readme (#9365) Co-authored-by: Amog Kamsetty <[email protected]> * Combine different severities into the same log files (#9230) * Combine different severities into the same log files Co-authored-by: Mehrdad <[email protected]> * [core] Pass owner address from the workers to the raylet (#9299) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (#9063)" This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1. * Fix free * fix tests * Fix tests * build * build * fix * Change assertion to warning to fix java * [Core] Add placement group scheduler and some api in resource scheduler (#9039) * Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (#8984). * change the bundle id and delete unit count in bundle change vector<bundle_spec> to vector<shared_ptr<bundle_spec>> Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (#8984). change the bundle id and delete unit count in bundle remove CheckIfSchedulable() add comments and fix the bug in resource * fix placement group schedule * add placement group scheduler and change some api in resource scheduler * fix by the comments * fix conflict * fix lint * fix lint * fix bug in merge * fix lint Co-authored-by: Lingxuan Zuo <[email protected]> * [Core] New scheduler fixes (#9186) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * Fixed scheduling tests * . * . * [Core] put small objects in memory store (#8972) * remove the put in memory store * put small objects directly in memory store * cast data type * fix another place that uses Put to spill to plasma store * fix multiple tests related to memory limits * partially fix test_metrics * remove not functioning codes * fix core_worker_test * refactor put to plasma codes * add a flag for the new feature * add flag to more places * do a warmup round for the plasma store * lint * lint again * fix warmup store * Update _raylet.pyx Co-authored-by: Eric Liang <[email protected]> * [autoscaler] Move command runners into separate file and clean up interface. (#9340) * cleanup * wip * fix imports * fix lint * [docs][rllib] Recommended workflow for training, saving, and testing (#9319) * [autoscaler] Allow users to disable the cluster config cache (#8117) * [autoscaler] Remove autoscaler config cache. * [autoscaler] Add flag allowing users to explicitly disable the config cache. * Update hiredis and remove Windows patches (#9289) Co-authored-by: Mehrdad <[email protected]> * Fix flaky test_dynres.py (#9310) * Fix gcs_table_storage testcase bug (#9393) Co-authored-by: 灵洵 <[email protected]> * [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403) * Change Python's `ObjectID` to `ObjectRef` (#9353) * [Java] Improve JNI performance when submitting and executing tasks (#9032) * Remove the RAY_CHECK in Worker::Port() (#9348) * [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386) * Fix macos compliation bug (#9391) * Fix. * [Core] Plasma RAII support (#9370) * [Serve] Merge router with HTTPProxy (#9225) * Pass run args to DockerCommandRunner (#9411) * Fix copy to workspace (#9400) * [RLlib] Tf2.x native. (#8752) * Update conda and ray wheel on GCP images (#9388) * [Core] Simplify Raylet Client (#9420) * Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407) * [RLLib] WindowStat bug fix (#9213) * WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910. https://github.com/ray-project/ray/issues/7910 * [tune] handling nan values (#9381) * TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439) Co-authored-by: Mehrdad <[email protected]> * [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422) * [Tune] Trainable documentation fix (#9448) * Allow --lru-evict to be passed into `ray start` (#8959) * GCP authentication using oauth tokens (#9279) * Bazel selects compiler flags based on compiler (#9313) Co-authored-by: Mehrdad <[email protected]> * [Core] Build raylet client as an independent component (#9434) * [tune] sklearn comment out (#9454) * Add ability to specify SOCKS proxy for SSH connections (#8833) * [docs] Render ActorPool documentation, etc (#9433) * [tune] Put examples under proper version control (#9427) Co-authored-by: krfricke <[email protected]> * Fix test-multi-node (#9453) * Machine View Sorting / Grouping (#9214) * Convert NodeInfo.tsx to a functional component * Update NodeRowGroup to be a functional component * lint * Convert TotalRow to functional component. * lint * move node info over to using the sortable table head component. spacing is still a little wonky. * Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping * Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer * Add sort accessors for CPU * Add sort accessors for Disk * Add sort accessors for RAM * add a table sort util for function based accessors (rather than flat attribute-based accessor) * wip refactor node info features * wip * Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic * wip * wip * wip * Finish adding sorting and grouping of machine view * lint * fix bug in filtration of logs and errors by worker from recent refactor. * Add export of Cluster Disk feature * fix some merge issues Co-authored-by: Max Fitton <[email protected]> * [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269) * [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429) * Fix gcs_pubsub_test bug(#9438) Co-authored-by: 灵洵 <[email protected]> * change error code name of boost timer (#9417) * [tune] PyTorch CIFAR10 example (#9338) Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Kai Fricke <[email protected]> * Remove legacy C++ code (#9459) * Fix ObjectRef and ActorHandle serialization (#9462) * [Stats] metrics agent exporter (#9361) * [Core] Support GCS server port assignment. (#8962) * Add scripts symlink back (#9219) (#9475) (cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690) Co-authored-by: Simon Mo <[email protected]> * [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461) * [docker] Include base-deps image in rayproject Docker Hub (#9458) * [Core] remove create_and_seal and create_and_seal_batch (#9457) * Speedups for GitHub Actions (#9343) Co-authored-by: Mehrdad <[email protected]> * Fix flaky test_object_manager.py (#9472) * [Java] fix redis-server binary path (#9398) * [core] Handle out-of-order actor table notifications (#9449) * Drop stale actor table notifications * build * Add num_restarts to disconnect handler * Unit test and increment num_restarts on ALIVE, not RESTARTING * Wait for pid to exit * Fix name clash on Windows (#9412) Co-authored-by: Mehrdad <[email protected]> * Add job configs to gcs (#9374) * Make pip install verbose (#9496) Co-authored-by: Mehrdad <[email protected]> * Make more tests compatible with Windows (#9303) * [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451) Co-authored-by: Kai Fricke <[email protected]> * [core] Replace task resubmission in raylet with ownership protocol (#9394) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (#9063)" This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1. * Fix free * Regression tests - shorten timeouts in reconstruction unit tests * Remove timeout for non-actor tasks * Modify tests using ray.internal.free * Clean up future resolution code * Raylet polls the owner * todo * comment * Update src/ray/core_worker/core_worker.cc Co-authored-by: Edward Oakes <[email protected]> * Drop stale actor table notifications * Fix bug where actor restart hangs * Revert buggy code for duplicate tasks * build * Fix errors for lru_evict and internal.free * Revert "Drop stale actor table notifications" This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91. * Revert "build" This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29. * Fix free test * Fixes for freed objects Co-authored-by: Edward Oakes <[email protected]> * release gil in global state accessor (#9357) * [Java] Named java actor (#9037) * Fix clang-cl build (#9494) Co-authored-by: Mehrdad <[email protected]> * [GCS Actor Management] Gcs actor management broken detached actor (#9473) * [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497) * Get rid of build shell scripts and move them to Python (#6082) * Fix broken test_raylet_info_endpoint (#9511) * Fix. (#9464) * [Autoscaler] Making bootstrap config part of the node provider interface (#9443) * supporting custom bootstrap config for external node providers * bootstrap config * renamed config to cluster_config * lint * remove 2 args from importer * complete move of bootstrap to node_provider * renamed provider_cls * move imports outside functions * lint * Update python/ray/autoscaler/node_provider.py Co-authored-by: Eric Liang <[email protected]> * final fixes * keeping lines to reduce diff * lint * lamba config * filling in -> adding for lint Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Eric Liang <[email protected]> * Fix flaky test_actor_failures::test_actor_restart (#9509) * Fix flaky test * os exit * [rllib] MAML Transform (#9463) * MAML Transform * Moved Inner Adapt to Method in Execution Plan * Cleanup Plasma Store (hash utilities) (#9524) * [Serve] Improve buffering for simple cases (#9485) * [Serve] Use pickle instead of clouldpickle (#9479) * Fix pip and Bazel interaction messing up CI (#9506) Co-authored-by: Mehrdad <[email protected]> * [Core] Fix Java detached error (#9526) * fix java createActor NPE bug (#9532) * [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516) * [Stats] Fix metric exporter test (#9376) * Hotfix Lint for Serve (#9535) * Windows cleanup (#9508) * Remove unneeded code for Windows * Get rid of usleep() * Make platform_shims includes non-transitive Co-authored-by: Mehrdad <[email protected]> * [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527) * Add placement group manager and some code in core_worker (#9120) Co-authored-by: Lingxuan Zuo <[email protected]> * [core] Add flag to enable object reconstruction during ray start (#9488) * Add flag * doc * Fix tests * Pipelining task submission to workers (#9363) * first step of pipelining * pipelining tests & default configs - added pipelining unit tests in direct_task_transport_test.cc - added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker - consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_ * post-review revisions * linting, following naming/style convention * linting * [New scheduler] Queueing refactor (#9491) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * . * . * . * . * . * . * . * cleanup * address reviews * address reviews * more refactor * :) * travis pls * . * travis pls * . * [Serve] Add internal instruction for running benchmarks (#9531) * MADDPG learning confirmation test. (#9538) * Fix Bazel in Docker (#9530) Co-authored-by: Mehrdad <[email protected]> * Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539) Co-authored-by: 灵洵 <[email protected]> * [tune] Unflattened lookup for ProgressReporter (#9525) Co-authored-by: Kai Fricke <[email protected]> * Add plasma store benchmark for small objects (#9549) * [Tune] Copy default_columns in new ProgressReporter instances (#9537) * quickfix (#9552) * [tune] pin tune-sklearn (#9498) * [cli] ray memory: added redis_password (#9492) * [GCS]Fix lease worker leak bug when gcs server restarts (#9315) * add part code * fix compile bug * fix review comments * fix review comments * fix review comments * fix review comments * fix review comment * fix ut bug * fix lint error * fix review comment * fix review comments * add testcase * add testcase * fix bug * fix review comments * fix review comment * fix review comment * refine comments Co-authored-by: 灵洵 <[email protected]> Co-authored-by: Hao Chen <[email protected]> * [tune] fix pbt checkpoint_freq (#9517) * Only delete old checkpoint if it is not the same as the new one * Return early if old checkpoint value coincides with new checkpoint value Co-authored-by: Kai Fricke <[email protected]> * [Core] Remove socket pair exchange in Plasma Store (#9565) * try use boost::asio for notification processing * [Metric] new cython interface for python worker metric (#9469) * Bazel fixes (#9519) * GCS client add fetch operation before subscribe (#9564) * [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521) * Change aggregation when lockstep is activated. Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy. fix ray-project/ray#9295 * Line too long. * [Core] Replace the Plasma eventloop with boost::asio (#9431) * Fix Java named actor bug (#9580) * Fix setup.py bug (#9581) Co-authored-by: Mehrdad <[email protected]> * [Serve] Serialize Query object directly (#9490) * Add dashboard dependencies to default ray installation (#9447) * Dashboard next-version API support in backend (#9345) * Fix log losses (#9559) * Close log on shutdown * Disable log buffering Co-authored-by: Mehrdad <[email protected]> * [docker] run Ubuntu 20.04 as base image (#9556) * Add PTL to README.rst (#9594) Co-authored-by: Richard Liaw <[email protected]> * Skip uneeded steps on CI (#9582) Co-authored-by: Mehrdad <[email protected]> * Fix Windows CI (#9588) Co-authored-by: Mehrdad <[email protected]> * [serve] Rename to `Controller` (#9566) * Handle warnings in core (#9575) * [New scheduler] Fix new scheduler bug (#9467) * fix new scheduler bug * add testcase for soft resource allocation * modify RemoveNode * Ensure unique log file names across same-node raylets. (#9561) * fix tag key typo (#9606) * Rename path variable due to zsh conflict (#9610) * [doc] [minor] Make API docs easier to find. (#9604) * Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572) * Use UTF-8 for encoding of python code for collision hashing (#9586) Co-authored-by: Arne Sachtler <[email protected]> Co-authored-by: simon-mo <[email protected]> * Add bazel to the PATH in setup.py (#9590) Co-authored-by: Mehrdad <[email protected]> * Fix Lint in setup.py (#9618) Co-authored-by: Mehrdad <[email protected]> * Shellcheck comments (#9595) * [Serve] Document Metric Infrastructure (#9389) * [CI] Do not run jenkins test on GHA (#9621) * Support ray task type checking (#9574) * [Metrics] Java metric API (#9377) * [GCS] fix the fault tolerance about gcs node manager (#9380) * Shellcheck quoting (#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <[email protected]> * Fix bug in Bazel version check (#9626) Co-authored-by: Mehrdad <[email protected]> * [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033) * Revert "Dashboard next-version API support in backend (#9345)" (#9639) This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe. * [Autoscaler] Command Line Interface improvements (#9322) Co-authored-by: Richard Liaw <[email protected]> * [Core] GCS Actor management on by default. (#8845) * GCS Actor management on by default. * Fix travis config. * Change condition. * Remove unnecessary CI. * [Core] Fix concurrency issues in plasma store runner (#9642) * fix window jni unhappy compiler (#9635) * Fix TestObjectTableResubscribe testcase bug (#9650) * fix named actor single process mode bug (#9652) * [core] Fix Ray service startup when logging redirection is disabled. (#9547) * Fix TorchDeterministic (#9241) * [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661) Co-authored-by: Kai Fricke <[email protected]> * [rllib] Fix torch TD error, IMPALA LR updates (#9477) * update * add test * lint * fix super call * speed es test up * Auto-cancel build when a new commit is pushed (#8043) Co-authored-by: Mehrdad <[email protected]> * Fix lint in remote-watch.py (#9668) * [Core] Remove unnecessary windows syscall in plasma store (#9602) * Remove unused windows shims (#9583) * Temporarily disable remote watcher (#9669) * Drop support for Python 3.5. (#9622) * Drop support for Python 3.5. * Update setup.py * [Core] WorkerInterface refactor (#9655) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * . * . * . * Fixed tests * Fixed tests * . * [core] Enable object reconstruction for retryable actor tasks (#9557) * Test actor plasma reconstruction * Allow resubmission of actor tasks * doc * Test for actor constructor * Kill PID before removing node * Kill pid before node * fix java coreworker crash (#9674) * use help proto-init-macro for streaming config (#9272) * Update release information from 0.8.6. (#9124) * [BRING BACK TO MASTER] Update release information. * [MERGE TO MASTER] Add microbenchmark result. * Update asan tests to the doc. * Refinements to the Serve documentation (#9587) Co-authored-by: Dean Wampler <[email protected]> * [tune] survey (#9670) * Fix ERROR logging not being printed to standard error (#9633) Co-authored-by: Mehrdad <[email protected]> * [Tune Docs] Logging doc fix (#9691) * [rllib] Type annotations for model classes (#9646) * [Serve] Allow multiple HTTP servers. (#9523) * Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681) * [Serve] Fix Formatting, stale docs (#9617) * fixed simplex initialisation seeding bug (#9660) Co-authored-by: Petros Christodoulou <[email protected]> * Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697) Co-authored-by: Mehrdad <[email protected]> * Add Ray Serve to README.rst (#9688) * Shellcheck rewrites (#9597) * Fix SC2001: See if you can use ${variable//search/replace} instead. * Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames. * Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames. * Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true. * Fix SC2028: echo may not expand escape sequences. Use printf. * Fix SC2034: variable appears unused. Verify use (or export if used externally). * Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options. * Fix SC2071: > is for string comparisons. Use -gt instead. * Fix SC2154: variable is referenced but not assigned * Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. * Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op). * Fix SC2236: Use -n instead of ! -z. * Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr. * Fix SC2086: Double quote to prevent globbing and word splitting. Co-authored-by: Mehrdad <[email protected]> * [Autoscaler] CLI Logger docs (#9690) Co-authored-by: Richard Liaw <[email protected]> * Update rllib-algorithms.rst (#9640) * [tune] move jenkins tests to travis (#9609) Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Kai Fricke <[email protected]> * [RLlib] Implement DQN PyTorch distributional head. (#9589) * Add placement group java api (#9611) * add part code * add part code * add part code * fix code style * fix review comment * fix review comment * add part code * add part code * add part code * add part code * fix review comment * fix review comment * fix code style * fix review comment * fix lint error * fix lint error Co-authored-by: 灵洵 <[email protected]> * [Stats] Improve Stats::Init & Add it to GCS server (#9563) * [Core] Try remove all windows compat shims (#9671) * try remove compat for arrow * remove unistd.h * remove socket compat * delete arrow windows patch * Fix a few flaky tests (#9709) Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency * [GCS]Open test_gcs_fault_tolerance testcase (#9677) * enable test_gcs_fault_tolerance * fix lint error Co-authored-by: 灵洵 <[email protected]> * [Tests]lock vector to avoid potential flaky test (#9656) * [tune] distributed torch wrapper (#9550) * changes * add-working * checkpoint * ccleanu * fix * ok * formatting * ok * tests * some-good-stuff * fix-torch * ddp-torch * torch-test * sessions * add-small-test * fix * remove * gpu-working * update-tests * ok * try-test * formgat * ok * ok * [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045) * Only update raylet map when autoscaler configured (#9435) * [Dashboard] New dashboard skeleton (#9099) * Fixing multiple building issues * Make wait_for_condition raise exception when timing out. (#9710) * [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718) * Package and upload ray cross-platform jar (#9540) * Revert "Package and upload ray cross-platform jar (#9540)" (#9730) This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e. * Only build docker wheels in LINUX_WHEELS env (#9729) * Keep build-autoscaler-images.sh alive in CI (#9720) * [core] Removes Error when Internal Config is not set (#9700) * [Cluster Launcher] Re Org the cluster launcher pages. (#9687) * [RLlib] Offline Type Annotations (#9676) * Offline Annotations * Modifications * Fixed circular dependencies * Linter fix * Python api of placement group (#9243) * Include open-ssh-client for transparency (#9693) * Fix remote-watch.py (#9625) Co-authored-by: Mehrdad <[email protected]> * [docker] Uses Latest Conda & Py 3.7 (#9732) * Fix broken actor failure tests. (#9737) * [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727) * Fix package and upload ray jar (#9742) * Introduce file_mounts_sync_continuously cluster option (#9544) * Separate out file_mounts contents hashing into its own separate hash Add an option to continuously sync file_mounts from head node to worker nodes: monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes * add test and default value for file_mounts_sync_continuously * format code * Update comments * Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick Fixed so setup commands run when ray up is run and file_mounts content changes * Refactor so that runtime_hash retains previous behavior runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur. Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization * fix issue with hashing a hash * fix bug where trying to set contents hash when it wasn't generated * Fix lint error Fix bug in command_runner where check_output was no longer returning the output of the command * clear out provider between tests to get rid of flakyness * reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call * [dist] swap mac/linux wheel build order (#9746) * [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684) * [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680) * [Metrics]Ray java worker metric registry (#9636) * ray worker metrics gauge init * ray java metric mapping * add jni source files for gauge and tagkey * mapping all metric classes to stats object * check non-null for tags and name * lint * add symbol for native metric JNI * extern c for symbol * add tests for all metrics * Update Metric.java use metricNativePointer instead. * unify metric native stuff to one class * fix jni file * add comments for metric transform function in jni utils * move metric function to native metric file * remove unused disconnect jni * Add a metric registry for java metircs * Restore install-bazel.sh * Add some comments for metric registry * Fix thread safe problem of metrics * Fix metric tests and remove sleep code from tests * Fix comments of metrics Co-authored-by: lingxuan.zlx <[email protected]> * fix windows compile bug (#9741) Co-authored-by: 灵洵 <[email protected]> * Run _with_interactive in Docker (#9747) * [New scheduler] First unit test for task manager (#9696) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * bad git >:-( * small clean up * CR * . * . * One more fixture * One more fixture * . * . * bazel-format * . * [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607) * [Release] Fix release tests (#9733) * Register function race (#9346) * Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758) * Revert "[dist] swap mac/linux wheel build order (#9746)" This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85. * Revert "Fix package and upload ray jar (#9742)" This reverts commit c290c308fe1e496480db5c37489df619cff6168f. * Fix some Windows CI issues (#9708) Co-authored-by: Mehrdad <[email protected]> * Pin pytest version (#9767) * [Java] Use test groups to filter tests of different run modes (#9703) * [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770) * Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719) * [Stats] enable core worker stats (#9355) * [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416) * use a sole thread to handle heartbeat * separate signal thread * use work to avoid exiting when task is underway * protect shared data structure to avoid deadlock * add comments * decrease io service num * minor changes * fix test * per stephanie's comments * use single io service instead of 1-size io service pool * typo * [GCS Actor Management] Fix flaky test_dead_actors. (#9715) * Fix. * Add logs. * Add an unit test. * [TUNE] Tune Docs re-organization (#9600) Co-authored-by: Richard Liaw <[email protected]> * [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678) * [Core] Socket creation race condition bug fixes (#9764) * fix issues * hot fixes * test * test * Always info log * Fixed stderr logging (9765) * [Core] Custom socket name (#9766) * fix issues * hot fixes * test * test * socket name change only * Fix src/ray/core_worker/common.h deleted constructor (#9785) Co-authored-by: Mehrdad <[email protected]> * [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745) * More fixes * Applying latest changes in travis.yml * Fixing fixture data exclusions * Disable some java tests * Fix some CI errors * Update hash * Fixing more build issues * Fixing more build issues * Fix pipeline cache path * More fixes * Fix bazel test command * Fix bazel test * Fix general info steps * Custom env var for docker build * Trying a different way to install bazel * Bazel fix * Updating hash Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]> Co-authored-by: mehrdadn <[email protected]> Co-authored-by: Mehrdad <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Alisa <[email protected]> Co-authored-by: Lingxuan Zuo <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Stefan Schneider <[email protected]> Co-authored-by: Patrick Ames <[email protected]> Co-authored-by: Hao Chen <[email protected]> Co-authored-by: fangfengbin <[email protected]> Co-authored-by: 灵洵 <[email protected]> Co-authored-by: Tao Wang <[email protected]> Co-authored-by: Kai Yang <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Ian Rodney <[email protected]> Co-authored-by: Henk Tillman <[email protected]> Co-authored-by: Tanay Wakhare <[email protected]> Co-authored-by: Nicolaus93 <[email protected]> Co-authored-by: Vasily Litvinov <[email protected]> Co-authored-by: krfricke <[email protected]> Co-authored-by: Max Fitton <[email protected]> Co-authored-by: Max Fitton <[email protected]> Co-authored-by: kisuke95 <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Michael Mui <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: chaokunyang <[email protected]> Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Michael Luo <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: Tom <[email protected]> Co-authored-by: jerrylee.io <[email protected]> Co-authored-by: Raphael Avalos <[email protected]> Co-authored-by: William Falcon <[email protected]> Co-authored-by: Clark Zinzow <[email protected]> Co-authored-by: Robert Nishihara <[email protected]> Co-authored-by: Arne Sachtler <[email protected]> Co-authored-by: Arne Sachtler <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: ZhuSenlin <[email protected]> Co-authored-by: Max Fitton <[email protected]> Co-authored-by: Maksim Smolin <[email protected]> Co-authored-by: Dean Wampler <[email protected]> Co-authored-by: Dean Wampler <[email protected]> Co-authored-by: Bill Chambers <[email protected]> Co-authored-by: Petros Christodoulou <[email protected]> Co-authored-by: Petros Christodoulou <[email protected]> Co-authored-by: Justin Terry <[email protected]> Co-authored-by: Tao Wang <[email protected]> Co-authored-by: fyrestone <[email protected]> Co-authored-by: Alan Guo <[email protected]> Co-authored-by: bermaker <[email protected]> * Sync Upstream master (#50) * [core] Pull Manager exponential backoff (#13024) * [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793) * [release tests] test_many_tasks fix (#12984) * Add "beta" documentation for enabling object spilling manually (#13047) * [Serve] Handle Bug Fixes (#12971) * [Dashboard] Add GET /logical/actors API (#12913) * [GCS]Decouple gcs resource manager and gcs node manager (#13012) * [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031) * [GCS] Delete redis gcs client and redis_xxx_accessor (#12996) * [RLlib] Fix broken unity3d_env import in example server script. (#13040) * [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039) * [joblib] Fix flaky joblib test. (#13046) * [Tune]Add integer loguniform support (#12994) * Add integer quantization and loguniform support * Fix hyperopt qloguniform not being np.log'd first * Add tests, __init__ * Try to fix tests, better exceptions * Tweak docstrings * Type checks in SearchSpaceTest * Update docs * Lint, tests * Update doc/source/tune/api_docs/search_space.rst Co-authored-by: Kai Fricke <[email protected]> Co-authored-by: Kai Fricke <[email protected]> * [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048) * Add index for tasks to dispatch * Task dependency manager interface * Unsubscribe dependencies and tests * NodeManager * Revert "Add index for tasks to dispatch" This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea. * tmp * Move back to waiting if args not ready * update * Update to new form of brew cask install command * [Autoscaler] New output log format (#12772) * Fix typo RMSProp -> RMSprop (#13063) * [serve] Centralize HTTP-related logic in HTTPState (#13020) * Remove suppress output to see why wheel is not building * Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006) * New dependency manager * Switch raylet to new DependencyManager * PullManager accepts bundles * Cleanup, remove old task dependency manager * x * PullManager unit tests * lint * Unit tests * Rename * lint * test * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <[email protected]> * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <[email protected]> * x * lint Co-authored-by: SangBin Cho <[email protected]> * [docs] Fix args + kwargs instead of docstrings (#13068) * functools wraps * Fix typo (functoools -> functools) * Fix OS X Wheel Build - Update brew cask install (#13062) Co-authored-by: Richard Liaw <[email protected]> * speed up local mode object store get (#13052) Co-authored-by: senlin.zsl <[email protected]> * [RLlib] Execution Annotation (#13036) * [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943) * [C++ API] Added reference counting to ObjectRef (#13058) * Added reference counting to ObjectRef * Addressed the comments * [Core] Remove cuda support in plasma store (#13070) * remove cuda support in plasma store * [Core] Remote outdated external store (#13080) * remove outdated external store * [GCS] Move resource usage info to gcs resource manager (#13059) * [RLlib] JAXPolicy prep. PR #1. (#13077) * [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083) * [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064) * [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935) * other collectives all work * auto-linting * mannual linting #1 * mannual linting 2 * bugfix * add send/recv point-to-point calls * add some initial code for communicator caching * auto linting * optimize imports * minor fix * fix unpassed tests * support more dtypes * rerun some distributed tests for send/recv * linting * [Serve] [Doc] Front page update (#13032) * Deprecate experimental / dynamic resources (#13019) * [docs] fix wandb url (#13094) * [Serve] Implement Graceful Shutdown (#13028) * [Serve] Use ServeHandle in HTTP proxy (#12523) * [Java] Format ray java code (#13056) * [docker] Fix restart behavior with Docker (#12898) Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: ijrsvt <[email protected]> * Disable broken streaming tests (#13095) * [autoscaler] Make placement groups bypass max launch limit (#13089) * Serve metrics docs (#13096) * [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097) * [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) * [Doc] Fix Sphinx.add_stylesheet deprecation (#13067) * Fix streaming ci failure (#12830) * [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118) * [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113) * [RLlib] Deflake test case: 2-step game MADDPG. (#13121) * [RLlib] Trajectory view API docs. (#12718) * Job module without submission (#13081) Co-authored-by: 刘宝 <[email protected]> * [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091) * [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119) * [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131) * [serve] Async controller (#13111) * [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948) * [Serve] Use a small object to track requests (#13125) * [docs][kubernetes][minor] Update K8s examples in doce (#13129) * [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698) * [docs] Documentation + example for the C++ language API (#13138) * [Java] Support `wasCurrentActorRestarted` in actor task. (#13120) * Remove check. * Add test * fix lint * lint * Fix spotless lint * Address comments. * Fix lint Co-authored-by: Qing Wang <[email protected]> * [docs] Minor change to formating C++ docs. (#13151) * Deprecate setResource java api (#13117) * [docs] Small fix in C++ documentation. (#13154) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: root <[email protected]> * [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127) * [kubernetes][docs][minor] Kubernetes version warning (#13161) * [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817) * Locality-aware leasing for owned refs (pinned locations). * LessorPicker --> LeasePolicy. * Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects. * Update comments. * Turn on locality-aware leasing feature flag by default. * Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy. * Add lease policy consulting assertions to the direct task submitter tests. * Add lease policy tests. * LocalityLeasePolicy --> LocalityAwareLeasePolicy. * Add missing const declarations. Co-authored-by: SangBin Cho <[email protected]> * Add RAY_CHECK for raylet address nullptr when creating lease client. * Make the fact that LocalLeasePolicy always returns the local node more explicit. * Flatten GetLocalityData conditionals to make it more readable. * Add ReferenceCounter::GetLocalityData() unit test. * Add data-intensive microbenchmarks for single-node perf testing. * Add data-intensive microbenchmarks for simulated cluster perf testing. * Remove redundant comment. * Remove data-intensive benchmarks. * Add locality-aware leasing Python test. * Formatting changes in ray_perf.py. Co-authored-by: SangBin Cho <[email protected]> * Enabling the cancellation of non-actor tasks in a worker's queue (#12117) * wrote code to enable cancellation of queued non-actor tasks * minor changes * bug fixes * added comments * rev1 * linting * making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error * bug fix * added two unit tests * linting * iterating through pending_normal_tasks starting from end * fixup! iterating through pending_normal_tasks starting from end * fixup! fixup! iterating through pending_normal_tasks starting from end * post merge fixes * added debugging instructions, pulled Accept() out of guarded loop * removed debugging instructions, linting * [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061) * [Release] Update Release Process Documentation (#13123) * [Core] Remove Arrow dependencies (#13157) * remove arrow ubsan * remove arrow build depend * remove arrow buffer * [XGboost] Update Documentation (#13017) Co-authored-by: Richard Liaw <[email protected]> * [SGD] Fix Docstring for `as_trainable` (#13173) * Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178) This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2. * Surface object store spilling statistics in `ray memory` (#13124) * [ray_client]: Move from experimental to util (#13176) Change-Id: I9f054881f0429092d265cd6944d89804cce9d946 * Remove unused file(object_manager_integration_test.cc) (#12989) * Notify listeners after registered node stored (#13069) * [build]Update description and add some keywords (#13163) * [Collective][PR 2/6] Driver program declarative interfaces (#12874) * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * add a Backend class to make Backend string more robust * add several useful APIs * add some tests * added allreduce test * fix typos * fix several bugs found via unittests * fix and update torch test * changed back actor * rearange a bit before importing distributed test * add distributed test * remove scratch code * auto-linting * linting 2 * linting 2 * linting 3 * linting 4 * linting 5 * linting 6 * 2.1 2.2 * fix small bugs * minor updates * linting again * auto linting * linting 2 * final linting * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <[email protected]> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <[email protected]> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <[email protected]> * added actor test * lint * remove local sh * address most of richard's comments * minor update * remove the actor.option() interface to avoid changes in ray core * minor updates Co-authored-by: YLJALDC <[email protected]> Co-authored-by: Richard Liaw <[email protected]> * [serve] Merge ActorReconciler and BackendState (#13139) * [tune] better signature check for `tune.sample_from` (#13171) * [tune] better signature check for `tune.sample_from` * Update python/ray/tune/sample.py Co-authored-by: Sumanth Ratna <[email protected]> Co-authored-by: Sumanth Ratna <[email protected]> * Disable atexit test on windows (#13207) * [serve] Move controller state into separate files (#13204) * Update multi_agent_independent_learning.py (#13196) pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead * [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162) * [Tune] Fix PBT Transformers Example (#13174) * [Serve] HTTPOptions for deployment modes (#13142) * [tests] Fix Autoscaler Test failure on Windows (#13211) * skip create_or_update tests * Update python/ray/tests/test_autoscaler.py Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Ameer Haj Ali <[email protected]> * [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) * [GCS]Fix TestActorSubscribeAll bug (#13193) * [Metrics] Record per node and raylet cpu / mem usage (#12982) * Record per node and raylet cpu / mem usage * Add comments. * Addressed code review. * [Tune] Fix tune serve integration example (#13233) * [Redis] Note that each Redis Connect retry takes two minutes (#12183) * Slightly alter error message so it's the same in both cases. * Each retry takes about two minutes. * [Log] fix spdlog init race (#12973) * fix spdlog init race * use global logger * refine logger name and constructor * [Release] Add 1.1.0 release test logs (#13054) * Add microbenchmark to release logs * check in many_tasks stress test result * Add results of placement group stress test for 1.1.0 * Add result for test_dead_actors test and correct the name of test_many_tasks.txt * Add rllib regression test result * Add pytorch test results for rllib * remove extraneous log entries * [Core] Fix incorrect comment (#13228) * [Serialization] Fix cloudpickle (#13242) * [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195) * Start ray client server with 'ray start' (#13217) * [GCS]Add gcs actor schedule strategy (#13156) * Publish job/worker info with Hex format instead of Binary (#13235) * [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126) * [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247) Now that `HeadOnly` becomes the new default HTTP location, we can re-enable the long running tests to use local multi-clusters. (also fixed the controller's API to match up to date, we should have caught these, I will open issues for this.) * Update autoscaler-cluster yaml files for release tests (#13114) * [Release] Use ray-ml image for logn running test (#13267) * [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237) * [Tune] Improve error message for Session Detection (#13255) * Improve error message * log once * [Tune] Pin Tune Dependencies (#13027) Co-authored-by: Ian <[email protected]> * [Dependabot] Add Dependabot (#13278) Co-authored-by: Ian <[email protected]> * [docker] Pull if image is not present (#13136) * [GCS] Remove old lightweight resource usage report code path (#13192) * [Dashboard] Add GET /log_proxy API (#13165) * Fix a crash problem caused by GetActorHandle in ActorManager (#13164) * [ray_client] Add metadata to gRPC requests (#13167) * [RLlib] Preparatory PR for: Documentation on Model Building. (#13260) * [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286) * [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287) * Remove top-level ray.connect() and ray.disconnect() APIs (#13273) * [Pull manager] Only pull once per retry period (#13245) * . * docs * cleanup * . * . * . * . Co-authored-by: Alex <[email protected]> * [Cancellation] Make Test Cancel Easier to Debug (#13243) * first commit * lint-fix * [ray_client]: first draft of documentation (#13216) * Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305) * Finalize handling of RAY_ADDRESS * lint * [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215) * [RLlib] SlateQ Documentation (#13266) * [RLlib] Add more detailed Documentation on Model building API (#13261) * [tune] convert search spaces: parse spec before flattening (#12785) * Parse spec before flattening * flatten after parse * Test for ValueError if grid search is passed to search algorithms * remove empty extras streaming deps (#12933) * add the method annotation and a comment explaining what's happening (#13306) Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a * Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210) * [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332) * [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298) * fix removal of task dependencies (#13333) Co-authored-by: senlin.zsl <[email protected]> * [Serve] Support Starlette streaming response (#13328) * [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339) * [client] Report number of currently active clients on connect (#13326) * wip * update * update * reset worker * fix conn * fix * disable pycodestyle * Implement internal kv in ray client (#13344) * kv internal * fix * [Tune] Rename MLFlow to MLflow (#13301) * Forgot overwrite parameter in Ray client internal kv * Fix typo in Tune Docs (Checkpointing) (#13348) See issue #13299 * [Kubernetes][Docs] GPU usage (#13325) * gpu-note * gpu-note * More info * lint? * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <[email protected]> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <[email protected]> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <[email protected]> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <[email protected]> * GKE->Kubernetes Co-authored-by: Richard Liaw <[email protected]> * Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361) This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419. * [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359) * [tune] buffer trainable results (#13236) * Working prototype * Pass buffer length, fix tests * Don't buffer per default * Dispatch and process save in one go, added tests * Fix tests * Pass adaptive seconds to train_buffered, stop result processing after STOP decision * Fix tests, add release test * Update tests * Added detailed logs for slow operations * Update python/ray/tune/trial_runner.py Co-authored-by: Richard Liaw <[email protected]> * Apply suggestions from code review * Revert tests and go back to old tuning loop * nit Co-authored-by: Richard Liaw <[email protected]> * [Serve] Add dependency management support for driver not running in a conda env (#13269) * [RLlib] Add `__len__()` method to SampleBatch (#13371) * [Serve] Backend state unit tests (#13319) * trigger doc build for serve updates (#13373) * [Object Spilling] Long running object spilling test (#13331) * done. * formatting. * Remove unimplemented GetAll method in actor info accessor (#13362) * [Doc] Remove trailing whitespaces (#13390) * Enable Ray client server by default (#13350) * update * fix * fix test * update * [RLlib] Trajectory View API: Atari framestacking. (#13315) * [ray_client]: Wait for ready and retry on ray.connect() (#13376) * [ray_client]: wait until connection ready Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6 * lint Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0 * docs and retry minimum Change-Id: I43f5378322029267ddd69f518ce8206876e2129d * [Dashboard] Fix missing actor pid (#13229) * [ray_client]: Fix multiple attempts at checking connection (#13422) * Plumb retries update (#13411) * [Serve] [Doc] Improve batching doc (#13389) * [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514) * Fix Serve release test (#13385) * Add bazel logs upload to GHA (#13251) * [tune] Fix f-string in error message (#13423) * [serve] Pull out goal management logic into AsyncGoalManager class (#13341) * Make request_resources() use internal kv instead of redis pub sub (#13410) * Remove unused handler methods (#13394) * [Tune] Pin Transitive Dependencies (#13358) * Split out the part of get_node_ip_address for which the docstring is correct (#12796) * Fix raylet::MockWorker::GetProcess crashes (#13440) Co-authored-by: 刘宝 <[email protected]> * Revert "Enable Ray client server by default (#13350)" (#13429) This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d. * Fix linter error (#13451) * [GCS]Add gcs resource scheduler (#13072) * [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363) * [Core]Fix raylet scheduling bug (#13452) * [Core]Fix raylet scheduling bug * fix lint error * fix lint error Co-authored-by: 灵洵 <[email protected]> * [joblib] joblib strikes again but this time on windows (#13212) * [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424) * [kubernetes][minor] Operator garbage collection fix (#13392) * [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391) * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Job 38482.1 should now pass * Resolve merge conflict * [RLlib] Deflake 2x remote & local inference tests (external env). (#13459) * [docs] Add more guideline on using ray in slurm cluster (#12819) Co-authored-by: Sumanth Ratna <[email protected]> Co-authored-by: PENG Zhenghao <[email protected]> Co-authored-by: Richard Liaw <[email protected]> * [Dashboard] Fix GPU resource rendering issue (#13388) * [Release] Fix Serve release test (#13303) The Docker image we were using now uses `ray` users so we have to call sudo. * [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460) * Fix getting runtime context dict in driver (#13417) * [xgb] re-enable xgboost_ray tests (#13416) * re-enable * fix * update xgb_ray version * [Serialization] New custom serialization API (#13291) * new serialization API with doc & test * add more notes * refine notes * doc * [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220) * Added owned object reference before Plasma put on Create() + Seal() path. * Consolidated location table and reference table in reference counter. * Restore type in definition. * Clean up owned reference on failed Seal(). * Added RemoveOwnedObject test for reference counter. * Guard against ref going out of scope before location RPCs. * Add 'owner must have ref in scope' precondition to documentation for object location methods. * Move to separate Create() + Seal() methods for existing objects. * Clearer distinction between Create() and Seal() methods. * Make it clear that references will normally be cleaned up by reference counting. * [ray_client]: Support runtime_context as metadata (#13428) * [GCS]Remove unused class variable (#13454) * [Object Spilling] Dedup restore objects (#13470) * done. * Addressed code review. * [CI] Enable Dashboard tests for master (#13425) * [docker/dashboard] Fix ray dashboard (#12899) * [CI] Fix Windows Bazel Upload (#13436) * Return version info from Ray client connect, to allow for discovering version mismatches * Update ID specification doc (#13356) * [ray_client]: fix wrong reference in server_pickler (#13474) Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf * Bump dev branch to 2.0 to avoid endless version bump toil (#13497) * wip * fix * fix * Remove an unnecessary file (#13499) * [Tests] Skip failing windows tests (#13495) * skip failing windows tests * skip more * remove * updates * [tune] fix small docs typo (#13355) Signed-off-by: Richard Liaw <[email protected]> * move message to debug (#13472) * Minimal version of piping autoscaler events to driver logs (#13434) * sync write internal config in gcs (#13197) * Refactor node manager to eliminate `new_scheduler_enabled_` (#12936) * [GCS]Only publish changed field when node dead (#13364) * Only update changed field when node dead * node_id missed * [CI] Buildkite PR Environment for Simple Tests (#13130) * [GCS] Remove task info publish as nowhere uses it (#13509) * Remove task info publish as nowhere uses it * simplify right publish channel * [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467) * [tune] placement group support (#13370) * [Serve] Allow ObjectRef for Composition (#12592) * Add Dashboard Python Test to Buildkite (#13530) * Add ability to not start Monitor when calling `ray start` (#13505) * [tune] support experiment checkpointing for grid search (#13357) * Fix typo (#13098) * Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544) * [RLlib] MARWIL loss function test case and cleanup. (#134…
- Loading branch information