Skip to content

Commit

Permalink
Edpalenc/1.3.0 bonsai sync (#69)
Browse files Browse the repository at this point in the history
* Set up CI with Azure Pipelines

Specifically, we are setting a
travis like ADO pipeline following
what is already present in the .travis.yml
file in the root of the repo.

* Separating travis like pipeline from main pipeline

* Adding Jenkings jobs equivalent

* Making some improvements

* Adding validation of the upstream CI

* Disabling Tune and large memory tests

* Changing threshold for simple reservoir sampling test

* Addressing comments

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with more travis updates

* Updating CI with new cpp worker tests

* Setting code owners

* Fixing the version number generation

* Making main pipeline also our release pipeline

* Updating Azure Pipelines with travis updates

* Fixing wheels test

* Fixing codeowners

* Updating Azure Pipelines with travis updates

* Bumping up MACOSX_DEPLOYMENT_TARGET

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Disabling Serve tests

* Making explicit which branches GitHubActions workflows should watch

* Desabling Ray serve tests

* Installing numpy explicitly

* consolidating Ray test steps in one yml

* Syncing with upstream master 2020-07-30 (#21)

* [Core] Enhance common client connection (#9367)

* enhance client connection

* add write buffer async

* read message

* add test

* Bazel move more shell to native rules (#9314)

Co-authored-by: Mehrdad <[email protected]>

* [tune] Fix github readme (#9365)

Co-authored-by: Amog Kamsetty <[email protected]>

* Combine different severities into the same log files (#9230)

* Combine different severities into the same log files

Co-authored-by: Mehrdad <[email protected]>

* [core] Pass owner address from the workers to the raylet (#9299)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.

* Fix free

* fix tests

* Fix tests

* build

* build

* fix

* Change assertion to warning to fix java

* [Core] Add placement group scheduler and some api in resource scheduler (#9039)

* Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).

* change the bundle id and delete unit count in bundle

change vector<bundle_spec> to vector<shared_ptr<bundle_spec>>

Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).

change the bundle id and delete unit count in bundle

remove CheckIfSchedulable()

add comments and fix the bug in resource

* fix placement group schedule

* add placement group scheduler and change some api in resource scheduler

* fix by the comments

* fix conflict

* fix lint

* fix lint

* fix bug in merge

* fix lint

Co-authored-by: Lingxuan Zuo <[email protected]>

* [Core] New scheduler fixes (#9186)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* Fixed scheduling tests

* .

* .

* [Core] put small objects in memory store (#8972)

* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <[email protected]>

* [autoscaler] Move command runners into separate file and clean up interface. (#9340)

* cleanup

* wip

* fix imports

* fix lint

* [docs][rllib] Recommended workflow for training, saving, and testing (#9319)

* [autoscaler] Allow users to disable the cluster config cache (#8117)

* [autoscaler] Remove autoscaler config cache.

* [autoscaler] Add flag allowing users to explicitly disable the config cache.

* Update hiredis and remove Windows patches (#9289)

Co-authored-by: Mehrdad <[email protected]>

* Fix flaky test_dynres.py (#9310)

* Fix gcs_table_storage testcase bug (#9393)

Co-authored-by: 灵洵 <[email protected]>

* [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403)

* Change Python's `ObjectID` to `ObjectRef` (#9353)

* [Java] Improve JNI performance when submitting and executing tasks (#9032)

* Remove the RAY_CHECK in Worker::Port() (#9348)

* [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386)

* Fix macos compliation bug (#9391)

* Fix.

* [Core] Plasma RAII support (#9370)

* [Serve] Merge router with HTTPProxy (#9225)

* Pass run args to DockerCommandRunner (#9411)

* Fix copy to workspace (#9400)

* [RLlib] Tf2.x native. (#8752)

* Update conda and ray wheel on GCP images (#9388)

* [Core] Simplify Raylet Client (#9420)

* Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407)

* [RLLib] WindowStat bug fix (#9213)

* WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910.
https://github.com/ray-project/ray/issues/7910

* [tune] handling nan values (#9381)

* TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439)

Co-authored-by: Mehrdad <[email protected]>

* [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422)

* [Tune] Trainable documentation fix (#9448)

* Allow --lru-evict to be passed into `ray start` (#8959)

* GCP authentication using oauth tokens (#9279)

* Bazel selects compiler flags based on compiler (#9313)



Co-authored-by: Mehrdad <[email protected]>

* [Core] Build raylet client as an independent component (#9434)

* [tune] sklearn comment out (#9454)

* Add ability to specify SOCKS proxy for SSH connections (#8833)

* [docs] Render ActorPool documentation, etc (#9433)

* [tune] Put examples under proper version control (#9427)

Co-authored-by: krfricke <[email protected]>

* Fix test-multi-node (#9453)

* Machine View Sorting / Grouping (#9214)

* Convert NodeInfo.tsx to a functional component

* Update NodeRowGroup to be a functional component

* lint

* Convert TotalRow to functional component.

* lint

* move node info over to using the sortable table head component. spacing is still a little wonky.

* Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping

* Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer

* Add sort accessors for CPU

* Add sort accessors for Disk

* Add sort accessors for RAM

* add a table sort util for function based accessors (rather than flat attribute-based accessor)

* wip refactor node info features

* wip

* Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic

* wip

* wip

* wip

* Finish adding sorting and grouping of machine view

* lint

* fix bug in filtration of logs and errors by worker from recent refactor.

* Add export of Cluster Disk feature

* fix some merge issues

Co-authored-by: Max Fitton <[email protected]>

* [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269)

* [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429)

* Fix gcs_pubsub_test bug(#9438)

Co-authored-by: 灵洵 <[email protected]>

* change error code name of boost timer (#9417)

* [tune] PyTorch CIFAR10 example (#9338)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>

* Remove legacy C++ code (#9459)

* Fix ObjectRef and ActorHandle serialization (#9462)

* [Stats] metrics agent exporter (#9361)

* [Core] Support GCS server port assignment. (#8962)

* Add scripts symlink back (#9219) (#9475)

(cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690)

Co-authored-by: Simon Mo <[email protected]>

* [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461)

* [docker] Include base-deps image in rayproject Docker Hub (#9458)

* [Core] remove create_and_seal and create_and_seal_batch (#9457)

* Speedups for GitHub Actions (#9343)

Co-authored-by: Mehrdad <[email protected]>

* Fix flaky test_object_manager.py (#9472)

* [Java] fix redis-server binary path (#9398)

* [core] Handle out-of-order actor table notifications (#9449)

* Drop stale actor table notifications

* build

* Add num_restarts to disconnect handler

* Unit test and increment num_restarts on ALIVE, not RESTARTING

* Wait for pid to exit

* Fix name clash on Windows (#9412)

Co-authored-by: Mehrdad <[email protected]>

* Add job configs to gcs (#9374)

* Make pip install verbose (#9496)

Co-authored-by: Mehrdad <[email protected]>

* Make more tests compatible with Windows (#9303)

* [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451)

Co-authored-by: Kai Fricke <[email protected]>

* [core] Replace task resubmission in raylet with ownership protocol (#9394)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.

* Fix free

* Regression tests - shorten timeouts in reconstruction unit tests

* Remove timeout for non-actor tasks

* Modify tests using ray.internal.free

* Clean up future resolution code

* Raylet polls the owner

* todo

* comment

* Update src/ray/core_worker/core_worker.cc

Co-authored-by: Edward Oakes <[email protected]>

* Drop stale actor table notifications

* Fix bug where actor restart hangs

* Revert buggy code for duplicate tasks

* build

* Fix errors for lru_evict and internal.free

* Revert "Drop stale actor table notifications"

This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91.

* Revert "build"

This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29.

* Fix free test

* Fixes for freed objects

Co-authored-by: Edward Oakes <[email protected]>

* release gil in global state accessor (#9357)

* [Java] Named java actor (#9037)

* Fix clang-cl build (#9494)

Co-authored-by: Mehrdad <[email protected]>

* [GCS Actor Management] Gcs actor management broken detached actor (#9473)

* [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497)

* Get rid of build shell scripts and move them to Python (#6082)

* Fix broken test_raylet_info_endpoint (#9511)

* Fix. (#9464)

* [Autoscaler] Making bootstrap config part of the node provider interface (#9443)

* supporting custom bootstrap config for external node providers

* bootstrap config

* renamed config to cluster_config

* lint

* remove 2 args from importer

* complete move of bootstrap to node_provider

* renamed provider_cls

* move imports outside functions

* lint

* Update python/ray/autoscaler/node_provider.py

Co-authored-by: Eric Liang <[email protected]>

* final fixes

* keeping lines to reduce diff

* lint

* lamba config

* filling in -> adding for lint

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Eric Liang <[email protected]>

* Fix flaky test_actor_failures::test_actor_restart (#9509)

* Fix flaky test

* os exit

* [rllib] MAML Transform (#9463)

* MAML Transform

* Moved Inner Adapt to Method in Execution Plan

* Cleanup Plasma Store (hash utilities) (#9524)

* [Serve] Improve buffering for simple cases (#9485)

* [Serve] Use pickle instead of clouldpickle (#9479)

* Fix pip and Bazel interaction messing up CI (#9506)

Co-authored-by: Mehrdad <[email protected]>

* [Core] Fix Java detached error (#9526)

* fix java createActor NPE bug (#9532)

* [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516)

* [Stats] Fix metric exporter test (#9376)

* Hotfix Lint for Serve (#9535)

* Windows cleanup (#9508)

* Remove unneeded code for Windows

* Get rid of usleep()

* Make platform_shims includes non-transitive

Co-authored-by: Mehrdad <[email protected]>

* [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527)

* Add placement group manager and some code in core_worker (#9120)

Co-authored-by: Lingxuan Zuo <[email protected]>

* [core] Add flag to enable object reconstruction during ray start (#9488)

* Add flag

* doc

* Fix tests

* Pipelining task submission to workers (#9363)

* first step of pipelining

* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_

* post-review revisions

* linting, following naming/style convention

* linting

* [New scheduler] Queueing refactor (#9491)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* .

* .

* .

* .

* .

* .

* .

* cleanup

* address reviews

* address reviews

* more refactor

* :)

* travis pls

* .

* travis pls

* .

* [Serve] Add internal instruction for running benchmarks (#9531)

* MADDPG learning confirmation test. (#9538)

* Fix Bazel in Docker (#9530)

Co-authored-by: Mehrdad <[email protected]>

* Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539)

Co-authored-by: 灵洵 <[email protected]>

* [tune] Unflattened lookup for ProgressReporter (#9525)

Co-authored-by: Kai Fricke <[email protected]>

* Add plasma store benchmark for small objects (#9549)

* [Tune] Copy default_columns in new ProgressReporter instances (#9537)

* quickfix (#9552)

* [tune] pin tune-sklearn (#9498)

* [cli] ray memory: added redis_password (#9492)

* [GCS]Fix lease worker leak bug when gcs server restarts (#9315)

* add part code

* fix compile bug

* fix review comments

* fix review comments

* fix review comments

* fix review comments

* fix review comment

* fix ut bug

* fix lint error

* fix review comment

* fix review comments

* add testcase

* add testcase

* fix bug

* fix review comments

* fix review comment

* fix review comment

* refine comments

Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Hao Chen <[email protected]>

* [tune] fix pbt checkpoint_freq (#9517)

* Only delete old checkpoint if it is not the same as the new one

* Return early if old checkpoint value coincides with new checkpoint value

Co-authored-by: Kai Fricke <[email protected]>

* [Core] Remove socket pair exchange in Plasma Store (#9565)

* try use boost::asio for notification processing

* [Metric] new cython interface for python worker metric (#9469)

* Bazel fixes (#9519)

* GCS client add fetch operation before subscribe (#9564)

* [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521)

* Change aggregation when lockstep is activated.

Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy.

fix ray-project/ray#9295

* Line too long.

* [Core] Replace the Plasma eventloop with boost::asio (#9431)

* Fix Java named actor bug (#9580)

* Fix setup.py bug (#9581)

Co-authored-by: Mehrdad <[email protected]>

* [Serve] Serialize Query object directly (#9490)

* Add dashboard dependencies to default ray installation (#9447)

* Dashboard next-version API support in backend (#9345)

* Fix log losses (#9559)

* Close log on shutdown

* Disable log buffering

Co-authored-by: Mehrdad <[email protected]>

* [docker] run Ubuntu 20.04 as base image (#9556)

* Add PTL to README.rst (#9594)

Co-authored-by: Richard Liaw <[email protected]>

* Skip uneeded steps on CI (#9582)

Co-authored-by: Mehrdad <[email protected]>

* Fix Windows CI (#9588)

Co-authored-by: Mehrdad <[email protected]>

* [serve] Rename to `Controller` (#9566)

* Handle warnings in core (#9575)

* [New scheduler] Fix new scheduler bug (#9467)

* fix new scheduler bug

* add testcase for soft resource allocation

* modify RemoveNode

* Ensure unique log file names across same-node raylets. (#9561)

* fix tag key typo (#9606)

* Rename path variable due to zsh conflict (#9610)

* [doc] [minor] Make API docs easier to find. (#9604)

* Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572)

* Use UTF-8 for encoding of python code for collision hashing (#9586)

Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: simon-mo <[email protected]>

* Add bazel to the PATH in setup.py (#9590)

Co-authored-by: Mehrdad <[email protected]>

* Fix Lint in setup.py (#9618)

Co-authored-by: Mehrdad <[email protected]>

* Shellcheck comments (#9595)

* [Serve] Document Metric Infrastructure (#9389)

* [CI] Do not run jenkins test on GHA (#9621)

* Support ray task type checking (#9574)

* [Metrics] Java metric API (#9377)

* [GCS] fix the fault tolerance about gcs node manager (#9380)

* Shellcheck quoting (#9596)

* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.

* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.

* Fix SC2046: Quote this to prevent word splitting.

* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.

* Fix SC2068: Double quote array expansions to avoid re-splitting elements.

* Fix SC2086: Double quote to prevent globbing and word splitting.

* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).

* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?

* Fix SC2145: Argument mixes string and array. Use * or separate argument.

* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).

Co-authored-by: Mehrdad <[email protected]>

* Fix bug in Bazel version check (#9626)

Co-authored-by: Mehrdad <[email protected]>

* [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033)

* Revert "Dashboard next-version API support in backend (#9345)" (#9639)

This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe.

* [Autoscaler] Command Line Interface improvements (#9322)

Co-authored-by: Richard Liaw <[email protected]>

* [Core] GCS Actor management on by default. (#8845)

* GCS Actor management on by default.

* Fix travis config.

* Change condition.

* Remove unnecessary CI.

* [Core] Fix concurrency issues in plasma store runner (#9642)

* fix window jni unhappy compiler (#9635)

* Fix TestObjectTableResubscribe testcase bug (#9650)

* fix named actor single process mode bug (#9652)

* [core] Fix Ray service startup when logging redirection is disabled. (#9547)

* Fix TorchDeterministic (#9241)

* [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661)

Co-authored-by: Kai Fricke <[email protected]>

* [rllib] Fix torch TD error, IMPALA LR updates (#9477)

* update

* add test

* lint

* fix super call

* speed es test up

* Auto-cancel build when a new commit is pushed (#8043)

Co-authored-by: Mehrdad <[email protected]>

* Fix lint in remote-watch.py (#9668)

* [Core] Remove unnecessary windows syscall in plasma store (#9602)

* Remove unused windows shims (#9583)

* Temporarily disable remote watcher (#9669)

* Drop support for Python 3.5. (#9622)

* Drop support for Python 3.5.

* Update setup.py

* [Core] WorkerInterface refactor (#9655)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* .

* .

* .

* Fixed tests

* Fixed tests

* .

* [core] Enable object reconstruction for retryable actor tasks (#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node

* fix java coreworker crash (#9674)

* use help proto-init-macro for streaming config (#9272)

* Update release information from 0.8.6. (#9124)

* [BRING BACK TO MASTER] Update release information.

* [MERGE TO MASTER] Add microbenchmark result.

* Update asan tests to the doc.

* Refinements to the Serve documentation (#9587)

Co-authored-by: Dean Wampler <[email protected]>

* [tune] survey (#9670)

* Fix ERROR logging not being printed to standard error (#9633)

Co-authored-by: Mehrdad <[email protected]>

* [Tune Docs] Logging doc fix (#9691)

* [rllib] Type annotations for model classes (#9646)

* [Serve] Allow multiple HTTP servers. (#9523)

* Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681)

* [Serve] Fix Formatting, stale docs (#9617)

* fixed simplex initialisation seeding bug (#9660)

Co-authored-by: Petros Christodoulou <[email protected]>

* Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697)

Co-authored-by: Mehrdad <[email protected]>

* Add Ray Serve to README.rst (#9688)

* Shellcheck rewrites (#9597)

* Fix SC2001: See if you can use ${variable//search/replace} instead.

* Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

* Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames.

* Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

* Fix SC2028: echo may not expand escape sequences. Use printf.

* Fix SC2034: variable appears unused. Verify use (or export if used externally).

* Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.

* Fix SC2071: > is for string comparisons. Use -gt instead.

* Fix SC2154: variable is referenced but not assigned

* Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

* Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

* Fix SC2236: Use -n instead of ! -z.

* Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr.

* Fix SC2086: Double quote to prevent globbing and word splitting.

Co-authored-by: Mehrdad <[email protected]>

* [Autoscaler] CLI Logger docs (#9690)

Co-authored-by: Richard Liaw <[email protected]>

* Update rllib-algorithms.rst (#9640)

* [tune] move jenkins tests to travis (#9609)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>

* [RLlib] Implement DQN PyTorch distributional head. (#9589)

* Add placement group java api (#9611)

* add part code

* add part code

* add part code

* fix code style

* fix review comment

* fix review comment

* add part code

* add part code

* add part code

* add part code

* fix review comment

* fix review comment

* fix code style

* fix review comment

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [Stats] Improve Stats::Init & Add it to GCS server (#9563)

* [Core] Try remove all windows compat shims (#9671)

* try remove compat for arrow

* remove unistd.h

* remove socket compat

* delete arrow windows patch

* Fix a few flaky tests (#9709)

Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency

* [GCS]Open test_gcs_fault_tolerance testcase (#9677)

* enable test_gcs_fault_tolerance

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [Tests]lock vector to avoid potential flaky test (#9656)

* [tune] distributed torch wrapper (#9550)

* changes

* add-working

* checkpoint

* ccleanu

* fix

* ok

* formatting

* ok

* tests

* some-good-stuff

* fix-torch

* ddp-torch

* torch-test

* sessions

* add-small-test

* fix

* remove

* gpu-working

* update-tests

* ok

* try-test

* formgat

* ok

* ok

* [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045)

* Only update raylet map when autoscaler configured (#9435)

* [Dashboard] New dashboard skeleton (#9099)

* Fixing multiple building issues

* Make wait_for_condition raise exception when timing out. (#9710)

* [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718)

* Package and upload ray cross-platform jar (#9540)

* Revert "Package and upload ray cross-platform jar (#9540)" (#9730)

This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e.

* Only build docker wheels in LINUX_WHEELS env (#9729)

* Keep build-autoscaler-images.sh alive in CI (#9720)

* [core] Removes Error when Internal Config is not set (#9700)

* [Cluster Launcher] Re Org the cluster launcher pages. (#9687)

* [RLlib] Offline Type Annotations (#9676)

* Offline Annotations

* Modifications

* Fixed circular dependencies

* Linter fix

* Python api of placement group (#9243)

* Include open-ssh-client for transparency (#9693)

* Fix remote-watch.py (#9625)

Co-authored-by: Mehrdad <[email protected]>

* [docker] Uses Latest Conda & Py 3.7 (#9732)

* Fix broken actor failure tests. (#9737)

* [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727)

* Fix package and upload ray jar (#9742)

* Introduce file_mounts_sync_continuously cluster option (#9544)

* Separate out file_mounts contents hashing into its own separate hash

Add an option to continuously sync file_mounts from head node to worker nodes:
monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes

* add test and default value for file_mounts_sync_continuously

* format code

* Update comments

* Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick

Fixed so setup commands run when ray up is run and file_mounts content changes

* Refactor so that runtime_hash retains previous behavior

runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run
file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur.

Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization

* fix issue with hashing a hash

* fix bug where trying to set contents hash when it wasn't generated

* Fix lint error

Fix bug in command_runner where check_output was no longer returning the output of the command

* clear out provider between tests to get rid of flakyness

* reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call

* [dist] swap mac/linux wheel build order (#9746)

* [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684)

* [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680)

* [Metrics]Ray java worker metric registry (#9636)

* ray worker metrics gauge init

* ray java metric mapping

* add jni source files for gauge and tagkey

* mapping all metric classes to stats object

* check non-null for tags and name

* lint

* add symbol for native metric JNI

* extern c for symbol

* add tests for all metrics

* Update Metric.java

use metricNativePointer instead.

* unify metric native stuff to one class

* fix jni file

* add comments for metric transform function in jni utils

* move metric function to native metric file

* remove unused disconnect jni

* Add a metric registry for java metircs

* Restore install-bazel.sh

* Add some comments for metric registry

* Fix thread safe problem of metrics

* Fix metric tests and remove sleep code from tests

* Fix comments of metrics

Co-authored-by: lingxuan.zlx <[email protected]>

* fix windows compile bug (#9741)

Co-authored-by: 灵洵 <[email protected]>

* Run _with_interactive in Docker (#9747)

* [New scheduler] First unit test for task manager (#9696)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* bad git >:-(

* small clean up

* CR

* .

* .

* One more fixture

* One more fixture

* .

* .

* bazel-format

* .

* [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607)

* [Release] Fix release tests (#9733)

* Register function race (#9346)

* Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758)

* Revert "[dist] swap mac/linux wheel build order (#9746)"

This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85.

* Revert "Fix package and upload ray jar (#9742)"

This reverts commit c290c308fe1e496480db5c37489df619cff6168f.

* Fix some Windows CI issues (#9708)

Co-authored-by: Mehrdad <[email protected]>

* Pin pytest version (#9767)

* [Java] Use test groups to filter tests of different run modes (#9703)

* [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770)

* Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719)

* [Stats] enable core worker stats (#9355)

* [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416)

* use a sole thread to handle heartbeat

* separate signal thread

* use work to avoid exiting when task is underway

* protect shared data structure to avoid deadlock

* add comments

* decrease io service num

* minor changes

* fix test

* per stephanie's comments

* use single io service instead of 1-size io service pool

* typo

* [GCS Actor Management] Fix flaky test_dead_actors. (#9715)

* Fix.

* Add logs.

* Add an unit test.

* [TUNE] Tune Docs re-organization (#9600)

Co-authored-by: Richard Liaw <[email protected]>

* [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678)

* [Core] Socket creation race condition bug fixes (#9764)

* fix issues

* hot fixes

* test

* test

* Always info log

* Fixed stderr logging (9765)

* [Core] Custom socket name (#9766)

* fix issues

* hot fixes

* test

* test

* socket name change only

* Fix src/ray/core_worker/common.h deleted constructor (#9785)

Co-authored-by: Mehrdad <[email protected]>

* [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745)

* More fixes

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Fix some CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix bazel test command

* Fix bazel test

* Fix general info steps

* Custom env var for docker build

* Trying a different way to install bazel

* Bazel fix

* Updating hash

Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]>
Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Alisa <[email protected]>
Co-authored-by: Lingxuan Zuo <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Stefan Schneider <[email protected]>
Co-authored-by: Patrick Ames <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Co-authored-by: fangfengbin <[email protected]>
Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: Kai Yang <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Ian Rodney <[email protected]>
Co-authored-by: Henk Tillman <[email protected]>
Co-authored-by: Tanay Wakhare <[email protected]>
Co-authored-by: Nicolaus93 <[email protected]>
Co-authored-by: Vasily Litvinov <[email protected]>
Co-authored-by: krfricke <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: kisuke95 <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Michael Mui <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: chaokunyang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Michael Luo <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: Tom <[email protected]>
Co-authored-by: jerrylee.io <[email protected]>
Co-authored-by: Raphael Avalos <[email protected]>
Co-authored-by: William Falcon <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: ZhuSenlin <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Maksim Smolin <[email protected]>
Co-authored-by: Dean Wampler <[email protected]>
Co-authored-by: Dean Wampler <[email protected]>
Co-authored-by: Bill Chambers <[email protected]>
Co-authored-by: Petros Christodoulou <[email protected]>
Co-authored-by: Petros Christodoulou <[email protected]>
Co-authored-by: Justin Terry <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: fyrestone <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: bermaker <[email protected]>

* Sync Upstream master (#50)

* [core] Pull Manager exponential backoff (#13024)

* [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793)

* [release tests] test_many_tasks fix (#12984)

* Add "beta" documentation for enabling object spilling manually (#13047)

* [Serve] Handle Bug Fixes (#12971)

* [Dashboard] Add GET /logical/actors API (#12913)

* [GCS]Decouple gcs resource manager and gcs node manager (#13012)

* [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031)

* [GCS] Delete redis gcs client and redis_xxx_accessor (#12996)

* [RLlib] Fix broken unity3d_env import in example server script. (#13040)

* [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039)

* [joblib] Fix flaky joblib test. (#13046)

* [Tune]Add integer loguniform support (#12994)

* Add integer quantization and loguniform support

* Fix hyperopt qloguniform not being np.log'd first

* Add tests, __init__

* Try to fix tests, better exceptions

* Tweak docstrings

* Type checks in SearchSpaceTest

* Update docs

* Lint, tests

* Update doc/source/tune/api_docs/search_space.rst

Co-authored-by: Kai Fricke <[email protected]>

Co-authored-by: Kai Fricke <[email protected]>

* [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)

* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update

* Update to new form of brew cask install command

* [Autoscaler] New output log format (#12772)

* Fix typo RMSProp -> RMSprop (#13063)

* [serve] Centralize HTTP-related logic in HTTPState (#13020)

* Remove suppress output to see why wheel is not building

* Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)

* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <[email protected]>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <[email protected]>

* x

* lint

Co-authored-by: SangBin Cho <[email protected]>

* [docs] Fix args + kwargs instead of docstrings (#13068)

* functools wraps

* Fix typo (functoools -> functools)

* Fix OS X Wheel Build - Update brew cask install (#13062)

Co-authored-by: Richard Liaw <[email protected]>

* speed up local mode object store get (#13052)

Co-authored-by: senlin.zsl <[email protected]>

* [RLlib] Execution Annotation (#13036)

* [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)

* [C++ API] Added reference counting to ObjectRef (#13058)

* Added reference counting to ObjectRef

* Addressed the comments

* [Core] Remove cuda support in plasma store (#13070)

* remove cuda support in plasma store

* [Core] Remote outdated external store (#13080)

* remove outdated external store

* [GCS] Move resource usage info to gcs resource manager (#13059)

* [RLlib] JAXPolicy prep. PR #1. (#13077)

* [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083)

* [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064)

* [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935)

* other collectives all work

* auto-linting

* mannual linting #1

* mannual linting 2

* bugfix

* add send/recv point-to-point calls

* add some initial code for communicator caching

* auto linting

* optimize imports

* minor fix

* fix unpassed tests

* support more dtypes

* rerun some distributed tests for send/recv

* linting

* [Serve] [Doc] Front page update (#13032)

* Deprecate experimental / dynamic resources (#13019)

* [docs] fix wandb url (#13094)

* [Serve] Implement Graceful Shutdown (#13028)

* [Serve] Use ServeHandle in HTTP proxy (#12523)

* [Java] Format ray java code (#13056)

* [docker] Fix restart behavior with Docker (#12898)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: ijrsvt <[email protected]>

* Disable broken streaming tests (#13095)

* [autoscaler] Make placement groups bypass max launch limit (#13089)

* Serve metrics docs (#13096)

* [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097)

* [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035)

* [Doc] Fix Sphinx.add_stylesheet deprecation (#13067)

* Fix streaming ci failure (#12830)

* [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118)

* [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113)

* [RLlib] Deflake test case: 2-step game MADDPG. (#13121)

* [RLlib] Trajectory view API docs. (#12718)

* Job module without submission (#13081)

Co-authored-by: 刘宝 <[email protected]>

* [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091)

* [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119)

* [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131)

* [serve] Async controller (#13111)

* [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)

* [Serve] Use a small object to track requests (#13125)

* [docs][kubernetes][minor] Update K8s examples in doce (#13129)

* [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698)

* [docs] Documentation + example for the C++ language API (#13138)

* [Java] Support `wasCurrentActorRestarted` in actor task. (#13120)

* Remove check.

* Add test

* fix lint

* lint

* Fix spotless lint

* Address comments.

* Fix lint

Co-authored-by: Qing Wang <[email protected]>

* [docs] Minor change to formating C++ docs. (#13151)

* Deprecate setResource java api (#13117)

* [docs] Small fix in C++ documentation. (#13154)

* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: root <[email protected]>

* [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127)

* [kubernetes][docs][minor] Kubernetes version warning (#13161)

* [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)

* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <[email protected]>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <[email protected]>

* Enabling the cancellation of non-actor tasks in a worker's queue (#12117)

* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061)

* [Release] Update Release Process Documentation (#13123)

* [Core] Remove Arrow dependencies (#13157)

* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer

* [XGboost] Update Documentation (#13017)

Co-authored-by: Richard Liaw <[email protected]>

* [SGD] Fix Docstring for `as_trainable` (#13173)

* Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)

This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2.

* Surface object store spilling statistics in `ray memory` (#13124)

* [ray_client]: Move from experimental to util (#13176)

Change-Id: I9f054881f0429092d265cd6944d89804cce9d946

* Remove unused file(object_manager_integration_test.cc) (#12989)

* Notify listeners after registered node stored (#13069)

* [build]Update description and add some keywords (#13163)

* [Collective][PR 2/6] Driver program declarative interfaces (#12874)

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* add a Backend class to make Backend string more robust

* add several useful APIs

* add some tests

* added allreduce test

* fix typos

* fix several bugs found via unittests

* fix and update torch test

* changed back actor

* rearange a bit before importing distributed test

* add distributed test

* remove scratch code

* auto-linting

* linting 2

* linting 2

* linting 3

* linting 4

* linting 5

* linting 6

* 2.1 2.2

* fix small bugs

* minor updates

* linting again

* auto linting

* linting 2

* final linting

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* added actor test

* lint

* remove local sh

* address most of richard's comments

* minor update

* remove the actor.option() interface to avoid changes in ray core

* minor updates

Co-authored-by: YLJALDC <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* [serve] Merge ActorReconciler and BackendState (#13139)

* [tune] better signature check for `tune.sample_from` (#13171)

* [tune] better signature check for `tune.sample_from`

* Update python/ray/tune/sample.py

Co-authored-by: Sumanth Ratna <[email protected]>

Co-authored-by: Sumanth Ratna <[email protected]>

* Disable atexit test on windows (#13207)

* [serve] Move controller state into separate files (#13204)

* Update multi_agent_independent_learning.py (#13196)

pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead

* [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162)

* [Tune] Fix PBT Transformers Example (#13174)

* [Serve] HTTPOptions for deployment modes (#13142)

* [tests] Fix Autoscaler Test failure on Windows (#13211)

* skip create_or_update tests

* Update python/ray/tests/test_autoscaler.py

Co-authored-by: Ameer Haj Ali <[email protected]>

Co-authored-by: Ameer Haj Ali <[email protected]>

* [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158)

* [GCS]Fix TestActorSubscribeAll bug (#13193)

* [Metrics] Record per node and raylet cpu / mem usage (#12982)

* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.

* [Tune] Fix tune serve integration example (#13233)

* [Redis] Note that each Redis Connect retry takes two minutes (#12183)

* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.

* [Log] fix spdlog init race (#12973)

* fix spdlog init race

* use global logger

* refine logger name and constructor

* [Release] Add 1.1.0 release test logs (#13054)

* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries

* [Core] Fix incorrect comment (#13228)

* [Serialization] Fix cloudpickle (#13242)

* [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195)

* Start ray client server with 'ray start' (#13217)

* [GCS]Add gcs actor schedule strategy (#13156)

* Publish job/worker info with Hex format instead of Binary (#13235)

* [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126)

* [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)

Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)

* Update autoscaler-cluster yaml files for release tests (#13114)

* [Release] Use ray-ml image for logn running test (#13267)

* [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237)

* [Tune] Improve error message for Session Detection (#13255)

* Improve error message

* log once

* [Tune] Pin Tune Dependencies (#13027)

Co-authored-by: Ian <[email protected]>

* [Dependabot] Add Dependabot (#13278)

Co-authored-by: Ian <[email protected]>

* [docker] Pull if image is not present (#13136)

* [GCS] Remove old lightweight resource usage report code path (#13192)

* [Dashboard] Add GET /log_proxy API (#13165)

* Fix a crash problem caused by GetActorHandle in ActorManager (#13164)

* [ray_client] Add metadata to gRPC requests (#13167)

* [RLlib] Preparatory PR for: Documentation on Model Building. (#13260)

* [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286)

* [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287)

* Remove top-level ray.connect() and ray.disconnect() APIs (#13273)

* [Pull manager] Only pull once per retry period (#13245)

* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <[email protected]>

* [Cancellation] Make Test Cancel Easier to Debug (#13243)

* first commit

* lint-fix

* [ray_client]: first draft of documentation (#13216)

* Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305)

* Finalize handling of RAY_ADDRESS

* lint

* [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215)

* [RLlib] SlateQ Documentation (#13266)

* [RLlib] Add more detailed Documentation on Model building API (#13261)

* [tune] convert search spaces: parse spec before flattening (#12785)

* Parse spec before flattening

* flatten after parse

* Test for ValueError if grid search is passed to search algorithms

* remove empty extras streaming deps (#12933)

* add the method annotation and a comment explaining what's happening (#13306)

Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a

* Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210)

* [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332)

* [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298)

* fix removal of task dependencies (#13333)

Co-authored-by: senlin.zsl <[email protected]>

* [Serve] Support Starlette streaming response (#13328)

* [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)

* [client] Report number of currently active clients on connect (#13326)

* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle

* Implement internal kv in ray client (#13344)

* kv internal

* fix

* [Tune] Rename MLFlow to MLflow (#13301)

* Forgot overwrite parameter in Ray client internal kv

* Fix typo in Tune Docs (Checkpointing) (#13348)

See issue #13299

* [Kubernetes][Docs] GPU usage (#13325)

* gpu-note

* gpu-note

* More info

* lint?

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* GKE->Kubernetes

Co-authored-by: Richard Liaw <[email protected]>

* Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361)

This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419.

* [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359)

* [tune] buffer trainable results (#13236)

* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <[email protected]>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <[email protected]>

* [Serve] Add dependency management support for driver not running in a conda env (#13269)

* [RLlib] Add `__len__()` method to SampleBatch (#13371)

* [Serve] Backend state unit tests (#13319)

* trigger doc build for serve updates (#13373)

* [Object Spilling] Long running object spilling test (#13331)

* done.

* formatting.

* Remove unimplemented GetAll method in actor info accessor (#13362)

* [Doc] Remove trailing whitespaces (#13390)

* Enable Ray client server by default (#13350)

* update

* fix

* fix test

* update

* [RLlib] Trajectory View API: Atari framestacking. (#13315)

* [ray_client]: Wait for ready and retry on ray.connect() (#13376)

* [ray_client]: wait until connection ready

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

* lint

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

* docs and retry minimum

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

* [Dashboard] Fix missing actor pid (#13229)

* [ray_client]: Fix multiple attempts at checking connection (#13422)

* Plumb retries update (#13411)

* [Serve] [Doc] Improve batching doc (#13389)

* [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514)

* Fix Serve release test (#13385)

* Add bazel logs upload to GHA (#13251)

* [tune] Fix f-string in error message (#13423)

* [serve] Pull out goal management logic into AsyncGoalManager class (#13341)

* Make request_resources() use internal kv instead of redis pub sub (#13410)

* Remove unused handler methods (#13394)

* [Tune] Pin Transitive Dependencies (#13358)

* Split out the part of get_node_ip_address for which the docstring is correct (#12796)

* Fix raylet::MockWorker::GetProcess crashes (#13440)

Co-authored-by: 刘宝 <[email protected]>

* Revert "Enable Ray client server by default (#13350)" (#13429)

This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d.

* Fix linter error (#13451)

* [GCS]Add gcs resource scheduler (#13072)

* [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363)

* [Core]Fix raylet scheduling bug (#13452)

* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [joblib] joblib strikes again but this time on windows (#13212)

* [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424)

* [kubernetes][minor] Operator garbage collection fix (#13392)

* [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391)

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict

* [RLlib] Deflake 2x remote & local inference tests (external env). (#13459)

* [docs] Add more guideline on using ray in slurm cluster (#12819)

Co-authored-by: Sumanth Ratna <[email protected]>
Co-authored-by: PENG Zhenghao <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* [Dashboard] Fix GPU resource rendering issue (#13388)

* [Release] Fix Serve release test (#13303)

The Docker image we were using now uses `ray` users so we have to call
sudo.

* [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460)

* Fix getting runtime context dict in driver (#13417)

* [xgb] re-enable xgboost_ray tests (#13416)

* re-enable

* fix

* update xgb_ray version

* [Serialization] New custom serialization API (#13291)

* new serialization API with doc & test

* add more notes

* refine notes

* doc

* [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)

* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.

* [ray_client]: Support runtime_context as metadata (#13428)

* [GCS]Remove unused class variable (#13454)

* [Object Spilling] Dedup restore objects (#13470)

* done.

* Addressed code review.

* [CI] Enable Dashboard tests for master (#13425)

* [docker/dashboard] Fix ray dashboard (#12899)

* [CI] Fix Windows Bazel Upload (#13436)

* Return version info from Ray client connect, to allow for discovering version mismatches

* Update ID specification doc (#13356)

* [ray_client]: fix wrong reference in server_pickler (#13474)

Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf

* Bump dev branch to 2.0 to avoid endless version bump toil (#13497)

* wip

* fix

* fix

* Remove an unnecessary file (#13499)

* [Tests] Skip failing windows tests (#13495)

* skip failing windows tests

* skip more

* remove

* updates

* [tune] fix small docs typo (#13355)

Signed-off-by: Richard Liaw <[email protected]>

* move message to debug (#13472)

* Minimal version of piping autoscaler events to driver logs (#13434)

* sync write internal config in gcs (#13197)

* Refactor node manager to eliminate `new_scheduler_enabled_` (#12936)

* [GCS]Only publish changed field when node dead (#13364)

* Only update changed field when node dead

* node_id missed

* [CI] Buildkite PR Environment for Simple Tests (#13130)

* [GCS] Remove task info publish as nowhere uses it (#13509)

* Remove task info publish as nowhere uses it

* simplify right publish channel

* [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467)

* [tune] placement group support (#13370)

* [Serve] Allow ObjectRef for Composition (#12592)

* Add Dashboard Python Test to Buildkite (#13530)

* Add ability to not start Monitor when calling `ray start` (#13505)

* [tune] support experiment checkpointing for grid search (#13357)

* Fix typo (#13098)

* Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544)

* [RLlib] MARWIL loss function test case and cleanup. (#134…
  • Loading branch information
Show file tree
Hide file tree
Showing 186 changed files with 10,360 additions and 1,502 deletions.
6 changes: 3 additions & 3 deletions .bazelrc
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ build:windows --enable_runfiles
# for compiling assembly files is fixed on Windows:
# https://github.com/bazelbuild/bazel/issues/8924
# Warnings should be errors
build:linux --per_file_copt="-\\.(asm|S)$@-Werror"
build:macos --per_file_copt="-\\.(asm|S)$@-Werror"
build:clang-cl --per_file_copt="-\\.(asm|S)$@-Werror"
#build:linux --per_file_copt="-\\.(asm|S)$@-Werror"
#build:macos --per_file_copt="-\\.(asm|S)$@-Werror"
#build:clang-cl --per_file_copt="-\\.(asm|S)$@-Werror"
build:msvc --per_file_copt="-\\.(asm|S)$@-WX"
# Ignore warnings for protobuf generated files and external projects.
build --per_file_copt="\\.pb\\.cc$@-w"
Expand Down
1 change: 1 addition & 0 deletions .bazelversion
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.4.1
37 changes: 25 additions & 12 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,38 +1,51 @@
# Each line is a file pattern followed by one or more owners.
# See https://help.github.com/articles/about-codeowners/
# for more info about CODEOWNERS file

# It uses the same pattern rule for gitignore file,
# see https://git-scm.com/docs/gitignore#_pattern_format.

# ==== Ray default ====
# These owners will be the default owners for everything in
# the repo. Unless a later match takes precedence,
# @BonsaiAI/ray-code-owners will be requested for
# review when someone opens a pull request.
* @BonsaiAI/ray-code-owners


# ==== Ray core ====

# All C++ code.
/src/ray @ray-project/ray-core-cpp
/src/ray @BonsaiAI/ray-maintainers

# Python worker.
/python/ray/ @ray-project/ray-core-python
!/python/ray/tune/ @ray-project/ray-core-python
!/python/ray/rllib/ @ray-project/ray-core-python
/python/ray/ @BonsaiAI/ray-maintainers
!/python/ray/tune/ @BonsaiAI/ray-maintainers
!/python/ray/rllib/ @BonsaiAI/ray-maintainers

# Java worker.
/java/ @ray-project/ray-core-java
/java/ @BonsaiAI/ray-maintainers

# Kube Operator.
/deploy/ @BonsaiAI/ray-maintainers

# ==== Libraries and frameworks ====

# Ray tune.
/python/ray/tune/ @ray-project/ray-tune
/python/ray/tune/ @BonsaiAI/ray-code-owners

# RLlib.
/python/ray/rllib/ @ray-project/rllib
/python/ray/rllib/ @BonsaiAI/ray-code-owners
/rllib/ @BonsaiAI/ray-code-owners

# ==== Build and CI ====

# Bazel.
/BUILD.bazel @ray-project/ray-core
/WORKSPACE @ray-project/ray-core
/bazel/ @ray-project/ray-core
/BUILD.bazel @BonsaiAI/ray-code-owners
/WORKSPACE @BonsaiAI/ray-code-owners
/bazel/ @BonsaiAI/ray-code-owners

# CI scripts.
/.travis.yml @ray-project/ray-core
/ci/travis/ @ray-project/ray-core
/.travis.yml @BonsaiAI/ray-maintainers
/ci/ @BonsaiAI/ray-maintainers

6 changes: 6 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,13 @@ on:
branches-ignore:
# Don't run CI for Dependabot branch pushes.
- "dependabot/**"
- '**'
# branches:
# - master
# - releases/*
pull_request:
branches-ignore:
- '**'

env:
# Git GITHUB_... variables are useful for translating Travis environment variables
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# The build output should clearly not be checked in
*test-output.xml
/bazel-*
/bazel-ray/
/python/ray/core
/python/ray/pickle5_files/
/python/ray/thirdparty_files/
Expand Down Expand Up @@ -186,3 +187,7 @@ tools/prometheus*
# ray project files
project-id
.mypy_cache/

# PyCharm
.ijwb/
.run/
4 changes: 2 additions & 2 deletions bazel/ray_deps_setup.bzl
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,8 @@ def ray_deps_setup():

auto_http_archive(
name = "bazel_common",
url = "https://github.com/google/bazel-common/archive/084aadd3b854cad5d5e754a7e7d958ac531e6801.tar.gz",
sha256 = "a6e372118bc961b182a3a86344c0385b6b509882929c6b12dc03bb5084c775d5",
url = "https://github.com/google/bazel-common/archive/bf87eb1a4ddbfc95e215b0897f3edc89b2254a1a.tar.gz",
sha256 = "dab4cbd634aae4bc9b116f4de5737e4d3c0754c3a1d712ad4a9b75140d278614",
)

auto_http_archive(
Expand Down
225 changes: 225 additions & 0 deletions ci/azure_pipelines/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# Azure Pipelines

This folder contains the code required to create the Azure Pipelines for the CI/CD of the Ray project.
Keep in mind that this could be outdated.
Please check the following links if you want to update the procedure.
- [Azure virtual machine scale set agents](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/scale-set-agents?view=azure-devops)
- [Repo for the Azure Pipelines images](https://github.com/actions/virtual-environments)

## Self-hosted Linux Agents

### Create VM Image

The following are the instructions to build the VM image of a self-hosted linux agent using a Virtual Hard Drive (VHD).
The image will be the same one that is used by the Microsoft-hosted linux agents. This approach
simplifies the maintenance and also allows to keep the pipelines code compatible with both
types of agents.

Requirements:
- Install packer : https://www.packer.io/downloads.html
- Install azure-cli : https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest

Steps for Mac and Ubuntu:
- Clone the GitHub Actions virtual environments repo: `git clone https://github.com/actions/virtual-environments.git`
- Move into the folder of the repo cloned aboved: `pushd virtual-environments/images/linux`
- Log in your azure account: `az login`
- Set your Azure subscription id and tenant id:
- Check your subscriptions: `az account list --output table`
- Set your default (replace your Subscription id in the command): `az account set -s {Subscription Id}`
- Get the subscription id: `SUBSCRIPTION_ID=$(az account show --query 'id' --output tsv)`
- Get the tenant id: `TENANT_ID=$(az account show --query 'tenantId' --output tsv)`
- Select the azure location: `AZURE_LOCATION="eastus"`
- Create and select the name of the resource group where the Azure resources will be created:
- Set the group: `RESOURCE_GROUP_NAME="RayADOAgents"`
- Try to create the group. If the resource group exists, the details for it will be returned: `az group create -n $RESOURCE_GROUP_NAME -l $AZURE_LOCATION`
- Create a Storage Account:
- Set Storage Account name: `STORAGE_ACCOUNT_NAME="rayadoagentsimage"`
- Create the Storage Account: `az storage account create -n $STORAGE_ACCOUNT_NAME -g $RESOURCE_GROUP_NAME -l $AZURE_LOCATION --sku "Standard_LRS"`
- Create a Service Principal. If you have an existing Service Principal, it can also be used instead of creating a new one:
- Set the object id: `OBJECT_ID="http://rayadoagents"`
- Create client and get secret: `CLIENT_SECRET=$(az ad sp create-for-rbac -n $OBJECT_ID --scopes="/subscriptions/${SUBSCRIPTION_ID}" --query 'password' -o tsv)`. If the Principal already exist, this command returns the id of the role assignment. Please use your old password. Or delete the existing Principal with `az ad sp delete --id $OBJECT_ID`.
- Get client id: `CLIENT_ID=$(az ad sp show --id $OBJECT_ID --query 'appId' -o tsv)`
- Set Install password: `INSTALL_PASSWORD="$CLIENT_SECRET"`
- Create a Key Vault. If you have an existing Service Principal, it can also be used instead of creating a new one:
- Set Key Vault name: `KEY_VAULT_NAME="ray-agent-secrets"`
- Create the Key Vault: `az keyvault create --name $KEY_VAULT_NAME --resource-group $RESOURCE_GROUP_NAME --location $AZURE_LOCATION`. If the Key Vault exist, this command returns the info.
- Set a GitHub Personal Access Token with rights to download:
- Set Key Pair name: `GITHUB_FEED_TOKEN_NAME="raygithubfeedtoken"`
- Upload your PAT to the vault (replace your token in the command):`az keyvault secret set --name $GITHUB_FEED_TOKEN_NAME --vault-name $KEY_VAULT_NAME --value "{GitHub Token}"`
- Get PAT from the Vault: `GITHUB_FEED_TOKEN=$(az keyvault secret show --name $GITHUB_FEED_TOKEN_NAME --vault-name $KEY_VAULT_NAME --query 'value' --output tsv)`
- Create the Managed Disk image:
- Create a packer variables file:
```
cat << EOF > azure-variables.json
{
"client_id": "${CLIENT_ID}",
"client_secret": "${CLIENT_SECRET}",
"subscription_id": "${SUBSCRIPTION_ID}",
"tenant_id": "${TENANT_ID}",
"object_id": "${OBJECT_ID}",
"location": "${AZURE_LOCATION}",
"resource_group": "${RESOURCE_GROUP_NAME}",
"storage_account": "${STORAGE_ACCOUNT_NAME}",
"install_password": "${INSTALL_PASSWORD}",
"github_feed_token": "${GITHUB_FEED_TOKEN}"
}
EOF
```
- Execute packer build: `packer build -var-file=azure-variables.json ubuntu1604.json`

For more details (Check the following doc in the virtual environment repo)[https://github.com/actions/virtual-environments/blob/master/help/CreateImageAndAzureResources.md].


### Create Agent Pool

#### 1. Create the Virtual Machine Scale Set (VMSS)

Creation of the VMSS is done using the Azure Resource Manager (ARM) template, `image/agentpool.json`. The following are important fixed parameters that could be changed:

| Parameter | Description |
| ------------- | ------------- |
| vmssName | name of the VMSS to be created |
| instanceCount | number of VMs to create in initial deployemnt (can be changed later) |

Steps for Mac and Ubuntu:
- Log in your azure account: `az login`
- Set your Azure subscription id and tenant id:
- Check your subscriptions: `az account list --output table`
- Set your default: `az account set -s {Subscription Id}`
- Get the subscription id: `SUBSCRIPTION_ID=$(az account show --query 'id' --output tsv)`
- Get the tenant id: `TENANT_ID=$(az account show --query 'tenantId' --output tsv)`
- Set Storage Account name (same that is above): `STORAGE_ACCOUNT_NAME="rayadoagentsimage"`
- Select the azure location: `AZURE_LOCATION="eastus"`
- Create and select the name of the resource group where the Azure resources will be created:
- Set the group: `RESOURCE_GROUP_NAME="RayADOAgents"`
- Try to create the group. If the resource group exists, the details for it will be returned: `az group create -n $RESOURCE_GROUP_NAME -l $AZURE_LOCATION`
- Create a Key Vault. If you have an existing Service Principal, it can also be used instead of creating a new one:
- Set Key Vault name: `KEY_VAULT_NAME="ray-agent-secrets"`
- Create the Key Vault: `az keyvault create --name $KEY_VAULT_NAME --resource-group $RESOURCE_GROUP_NAME --location $AZURE_LOCATION`. If the Key Vault exist, this command returns the info.
- Create a Key Pair in the Vault:
- Set Key Pair name: `SSH_KEY_PAIR_NAME="rayagentadminrsa"`
- Set Key Pair name: `SSH_KEY_PAIR_NAME_PUB="${SSH_KEY_PAIR_NAME}pub"`
- Set SSH key pair file path: `SSH_KEY_PAIR_PATH="$HOME/.ssh/$SSH_KEY_PAIR_NAME"`
- Create the SSH key pair: `ssh-keygen -m PEM -t rsa -b 4096 -f $SSH_KEY_PAIR_PATH`
- Upload your key pair to the vault:
- Public part to be used by the VMs: `az keyvault secret set --name $SSH_KEY_PAIR_NAME_PUB --vault-name $KEY_VAULT_NAME --file ${SSH_KEY_PAIR_PATH}.pub`
- (Optional) Private part to be used by the VMs: `az keyvault secret set --name $SSH_KEY_PAIR_NAME --vault-name $KEY_VAULT_NAME --file $SSH_KEY_PAIR_PATH`
- Get public part from the Vault: `SSH_KEY_PUB=$(az keyvault secret show --name $SSH_KEY_PAIR_NAME_PUB --vault-name $KEY_VAULT_NAME --query 'value' --output tsv)`
- Create the VMSS:
- Set the Subnet Id of the subnet where the VMs must be: `SUBNET_ID="{Subnet Id}"`
- Set the VMSS name: `VMSS_NAME="RayPipelineAgentPoolStandardF16sv2"`
- Set the instance count: `INSTANCE_COUNT="2"`
- Get Reader role definition: `ROLE_DEFINITION_ID=$(az role definition list --subscription $SUBSCRIPTION_ID --query "([?roleName=='Reader'].id)[0]" --output tsv)`
- Set the source image VHD NAME (assuming the latest): `SOURCE_IMAGE_VHD_NAME="$(az storage blob list --subscription $SUBSCRIPTION_ID --account-name $STORAGE_ACCOUNT_NAME -c images --prefix pkr --query 'sort_by([], &properties.creationTime)[-1].name' --output tsv)"`
- Set the source image VHD URI: `SOURCE_IMAGE_VHD_URI="https://${STORAGE_ACCOUNT_NAME}.blob.core.windows.net/images/${SOURCE_IMAGE_VHD_NAME}"`
- Create the VM scale set: `az group deployment create --resource-group $RESOURCE_GROUP_NAME --template-file image/agentpool.json --parameters "vmssName=$VMSS_NAME" --parameters "instanceCount=$INSTANCE_COUNT" --parameters "sourceImageVhdUri=$SOURCE_IMAGE_VHD_URI" --parameters "sshPublicKey=$SSH_KEY_PUB" --parameters "location=$AZURE_LOCATION" --parameters "subnetId=$SUBNET_ID" --parameters "keyVaultName=$KEY_VAULT_NAME" --parameters "tenantId=$TENANT_ID" --parameters "roleDefinitionId=$ROLE_DEFINITION_ID" --name $VMSS_NAME`

#### 2. Create the Agent Pool in Azure DevOps

Open Azure DevOps > "Project Settings" (bottom right) > "Agent Pools" > "New Agent Pool" > "Add pool" to create a new agent pool. Enter the agent pool's name, which must match the value you provided VMSS_NAME (see steps above).

Make sure your admin is added as the administrator in ADO in 2 places:
- Azure DevOps > "Project Settings" (bottom right) > "Agent Pools" > [newly created agent poool] >"Security Tab" and
- Azure DevOps > bizair > Organization Settings > Agent Pools > Security

#### 3. Connect VMs to pool

Steps for Mac and Ubuntu:
- Copy some files to fix some errors in the generation of the agent image:
- The error is due to a issue with the packer script. It's not downloading a postgresql installation script.
In order to check if the image was not fully build, connect to the vm using ssh (see steps below), and run this: `INSTALLER_SCRIPT_FOLDER="/imagegeneration/installers" source /imagegeneration/installers/test-toolcache.sh`.
If you don't get any error message, skip the following 3 steps.
- Tar the image folder: `tar -zcvf image.tar.gz image`
- Set Key Pair name: `export SSH_KEY_PAIR_NAME="rayagentadminrsa"`
- Set SSH key pair file path: `export SSH_KEY_PAIR_PATH="$HOME/.ssh/$SSH_KEY_PAIR_NAME"`
- Set the IP of your VM: `export IP={my.ip}`
- Copy to each of your machines in the Scale set: `scp -o "IdentitiesOnly=yes" -i $SSH_KEY_PAIR_PATH ./image.tar.gz agentadmin@"${IP}":/home/agentadmin`
- Delete the tar: `rm image.tar.gz`
- Connect using ssh:
- Open a ssh tunnel: `ssh -o "IdentitiesOnly=yes" -i $SSH_KEY_PAIR_PATH agentadmin@"${IP}"`
- Fix the image:
- Untar the image file: `tar zxvf ./image.tar.gz`
- Switch to root: `sudo -s`
- In your machine get PAT from the Vault:
- Set Key Pair name: `export GITHUB_FEED_TOKEN_NAME="raygithubfeedtoken"`
- Set Key Vault name: `export KEY_VAULT_NAME="ray-agent-secrets"`
- Get the token: `az keyvault secret show --name $GITHUB_FEED_TOKEN_NAME --vault-name $KEY_VAULT_NAME --query 'value' --output tsv`
- Set the PAT in your ssh session: `export GITHUB_FEED_TOKEN={ GitHub Token }`
- Add agentadmin to the root group: `sudo gpasswd -a agentadmin root`
- Install missing part: `source ./image/fix-image.sh`
- Set the system up:
```
export GITHUB_FEED_TOKEN={ GitHub Token }
export DEBIAN_FRONTEND=noninteractive
export METADATA_FILE="/imagegeneration/metadatafile"
export HELPER_SCRIPTS="/imagegeneration/helpers"
export INSTALLER_SCRIPT_FOLDER="/imagegeneration/installers"
export BOOST_VERSIONS="1.69.0"
export BOOST_DEFAULT="1.69.0"
export AGENT_TOOLSDIRECTORY=/opt/hostedtoolcache
mkdir -p $INSTALLER_SCRIPT_FOLDER/node_modules
sudo chmod --recursive a+rwx $INSTALLER_SCRIPT_FOLDER/node_modules
sudo chown -R agentadmin:root $INSTALLER_SCRIPT_FOLDER/node_modules
source $INSTALLER_SCRIPT_FOLDER/hosted-tool-cache.sh
source $INSTALLER_SCRIPT_FOLDER/test-toolcache.sh
chown -R agentadmin:root $AGENT_TOOLSDIRECTORY
echo 'export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion" # This loads nvm bash_completion
AGENT_TOOLSDIRECTORY="/opt/hostedtoolcache/"' >> ~/.bashrc
```
- Go to the [New Agent] option in the pool and follow the instructions for linux agents:
- Download the agent: `wget https://vstsagentpackage.azureedge.net/agent/2.170.1/vsts-agent-linux-x64-2.170.1.tar.gz`
- Create and move to a directory for the agent: `mkdir myagent && cd myagent`
- Untar the agent: `tar zxvf ../vsts-agent-linux-x64-2.170.1.tar.gz`
- Configure the agent: `./config.sh`
- Accept the license.
- Enter your organization URL.
- Enter your ADO PAT.
- Set a Personal Access Token:
- Set Key Pair name: `ADO_TOKEN_NAME="rayagentadotoken"`
- Upload your PAT to the vault (replace your token in the command):`az keyvault secret set --name $ADO_TOKEN_NAME --vault-name $KEY_VAULT_NAME --value "{ADO Token}"`
- Enter the agent pool's name, which must match the value you provided VMSS_NAME (see steps above)
- Enter or accept agent name.
- Install the ADO Agent as a service and start it:
- `sudo ./svc.sh install`
- `sudo ./svc.sh start`
- `sudo ./svc.sh status`
- Allow agent user to access Docker:
- `export VM_ADMIN_USER="agentadmin"`
- `sudo gpasswd -a "${VM_ADMIN_USER}" docker`
- `sudo chmod ga+rw /var/run/docker.sock`
- Update group permissions so docker is available without logging out and back in: `newgrp - docker`
- Test docker: `docker run hello-world`
- `export VM_ADMIN_USER="agentadmin"`
- If `/home/"$VM_ADMIN_USER"/.docker` exist:
- `sudo chown "$VM_ADMIN_USER":docker /home/"$VM_ADMIN_USER"/.docker -R`
- `sudo chmod ga+rwx "$HOME/.docker" -R`
- Create a symlink:
- `mkdir -p /home/agentadmin/myagent/_work`
- `ln -s /opt/hostedtoolcache /home/agentadmin/myagent/_work/_tool`

### Deleting an Agent Pool

1. Open Azure DevOps > Settings > Agent Pools > find pool to be removed and click "..." > Delete
2. Open Azure Portal > Key Vaults > ray-agent-secrets > Access Policies > delete the access policy assigned to the VMSS to be deleted
3. Open Azure Portal > All Resources > type the VMSS name into the search bar > select and delete the following resources tied to that VMSS:
- public IP address
- load balancer
- the VMSS itself

### Useful Commands

```
# Get connection info for all VMSS instances
az vmss list-instance-connection-info -g $RESOURCE_GROUP_NAME --name $VMSS_NAME
# SSH to a VMSS instance
ssh -o "IdentitiesOnly=yes" -i $SSH_KEY_PAIR_PATH agentadmin@{ PUBLIC IP}
# Download agentadmin private SSH key (formatting is lost if key is pulled from the UI)
az keyvault secret download --file $SSH_KEY_PAIR_PATH --vault-name $KEY_VAULT_NAME --name $SSH_KEY_PAIR_NAME
az keyvault secret download --file ~/downloads/PAT --vault-name $KEY_VAULT_NAME --name $ADO_TOKEN_NAME
```
Loading

0 comments on commit 412fd55

Please sign in to comment.