Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync debug branch with master #15983

Merged
merged 42 commits into from
Dec 9, 2022
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
2debd1c
Simplify enabling CPU offload in FSDP (#15832)
awaelchli Dec 7, 2022
d2a8fbf
[App] Enable running with spawn context (#15923)
tchaton Dec 7, 2022
6f54a82
Fix compiler support test (#15927)
lantiga Dec 7, 2022
6aaac8b
Enable back inference mode support with hpu & update links (#15918)
jerome-habana Dec 7, 2022
64b19fb
[App] Introduce auto scaler (#15769)
akihironitta Dec 7, 2022
2041908
ENG-627: Docs for CloudCompute Mount Argument (#15182)
rlizzo Dec 7, 2022
de93167
Fix LRScheduler import for PyTorch 2.0 (#15940)
lantiga Dec 7, 2022
06163e6
CI: fix pypi flow (#15944)
Borda Dec 7, 2022
e250dfe
[App] Remove `SingleProcessRuntime` (#15933)
ethanwharris Dec 7, 2022
1283226
[App] Fix bug when using structures with works (#15911)
ethanwharris Dec 8, 2022
b8c7018
[App] Wait for full file to be transferred in Path / Payload (#15934)
ethanwharris Dec 8, 2022
e6f4c84
[docs] Include all components in the API reference (#15805)
akihironitta Dec 8, 2022
73a6dbe
Bump playwright from 1.27.1 to 1.28.0 in /requirements (#15903)
dependabot[bot] Dec 8, 2022
d5b9c67
[App] Add `configure_layout` method for works (#15926)
ethanwharris Dec 8, 2022
0d822e4
Make gradients available for all_gather on TPU (#15003)
stekiri Dec 8, 2022
ca5ca0e
Don't try to aggregate `requirements/__pycache__/base.txt` in setupto…
akihironitta Dec 8, 2022
df67833
[App] Multiprocessing-safe work pickling (#15836)
Dec 8, 2022
8475f85
Upgrade to HPU release 1.7.1 (#15956)
jerome-habana Dec 8, 2022
36aecde
Multinode on MPS (#15748)
justusschock Dec 8, 2022
904323b
[App] Resolve PythonServer on M1 (#15949)
tchaton Dec 8, 2022
3004f13
Lite: Fix DataLoader shuffling when using DistributedSampler (#15931)
awaelchli Dec 8, 2022
d0b101c
[App] Temporarily disable ready (#15958)
ethanwharris Dec 8, 2022
15184c6
Fix restarting attribute for lr finder (#15620)
justusschock Dec 8, 2022
482b279
[App] Improve pdb for multiprocessing (#15950)
tchaton Dec 8, 2022
772d121
[App] Improve debug triggering (#15951)
tchaton Dec 8, 2022
67a47d4
[App] Add automatic conversion to structures (#15961)
tchaton Dec 8, 2022
b5fa896
Make LightningModule torch.jit.script-able again (#15947)
awaelchli Dec 8, 2022
23b12ee
refactor: simplify Tensor import (#15959)
Borda Dec 8, 2022
cbd4dd6
Fix ImportErrors on Multinode if package not present (#15963)
justusschock Dec 8, 2022
7a1e0e8
Fix typo in definition of world size in docs (#15954)
awaelchli Dec 8, 2022
4983083
[App] Enable running an app from the Gallery (#15941)
tchaton Dec 8, 2022
edc9986
Apply dynamo to training_step, validation_step, test_step, predict_st…
lantiga Dec 8, 2022
d4fe8fb
Merge branch 'master' into lite/debug-sync-master
awaelchli Dec 9, 2022
a468f9d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 9, 2022
58927e1
fix merge conflict
awaelchli Dec 9, 2022
650db79
rename tpu workflow
awaelchli Dec 9, 2022
de8a07e
triggers
awaelchli Dec 9, 2022
7ecade8
Merge branch 'lite/debug' into lite/debug-sync-master
awaelchli Dec 9, 2022
c3fe3b6
update
awaelchli Dec 9, 2022
519cccb
update
awaelchli Dec 9, 2022
8f61e36
more fixes
awaelchli Dec 9, 2022
866f824
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 9, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .actions/setup_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ def _load_aggregate_requirements(req_dir: str = "requirements", freeze_requireme
load_requirements(d, file_name="base.txt", unfreeze=not freeze_requirements)
for d in glob.glob(os.path.join(req_dir, "*"))
# skip empty folder as git artefacts, and resolving Will's special issue
if os.path.isdir(d) and len(glob.glob(os.path.join(d, "*"))) > 0
if os.path.isdir(d) and len(glob.glob(os.path.join(d, "*"))) > 0 and "__pycache__" not in d
]
if not requires:
return None
Expand Down
2 changes: 1 addition & 1 deletion .azure/app-cloud-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
- job: App_cloud_e2e_testing
pool: azure-cpus
container:
image: mcr.microsoft.com/playwright/python:v1.27.1-focal
image: mcr.microsoft.com/playwright/python:v1.28.0-focal
options: "--shm-size=4gb"
strategy:
matrix:
Expand Down
2 changes: 1 addition & 1 deletion .azure/hpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:
cancelTimeoutInMinutes: "2"
pool: intel-hpus
container:
image: "vault.habana.ai/gaudi-docker/1.7.0/ubuntu20.04/habanalabs/pytorch-installer-1.12.0:latest"
image: "vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/pytorch-installer-1.13.0:latest"
options: "--runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --shm-size=4g -v /usr/bin/docker:/tmp/docker:ro"
workspace:
clean: all
Expand Down
2 changes: 1 addition & 1 deletion .github/actions/pkg-publish/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ runs:
if: inputs.pypi-test-token != ''
with:
user: __token__
password: ${{ secrets.test_pypi_token_lai }}
password: ${{ inputs.pypi-test-token }}
repository_url: https://test.pypi.org/legacy/
packages_dir: pypi/
verbose: true
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci-app-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ jobs:

- name: Adjust tests
if: ${{ matrix.pkg-name == 'lightning' }}
run: python .actions/assistant.py copy_replace_imports --source_dir="./tests" --source_import="lightning_app" --target_import="lightning.app"
run: python .actions/assistant.py copy_replace_imports --source_dir="./tests" --source_import="lightning_app,lightning_lite,pytorch_lightning" --target_import="lightning.app,lightning.lite,lightning.pytorch"

- name: Adjust examples
if: ${{ matrix.pkg-name != 'lightning' }}
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/legacy-checkpoints.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ jobs:
working-directory: ./
env:
PACKAGE_NAME: pytorch
FREEZE_REQUIREMENTS: 1
run: |
pip install . -f https://download.pytorch.org/whl/cpu/torch_stable.html
pip list
Expand Down
9 changes: 3 additions & 6 deletions .github/workflows/release-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@ defaults:
run:
shell: bash

env:
PUBLISH: ${{ startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release' }}

jobs:
init:
runs-on: ubuntu-20.04
Expand Down Expand Up @@ -184,7 +181,7 @@ jobs:

publish-packages:
runs-on: ubuntu-20.04
needs: waiting
needs: [build-packages, waiting]
if: startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release'
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -215,8 +212,8 @@ jobs:
needs: [build-packages]
uses: ./.github/workflows/legacy-checkpoints.yml
with:
push_to_s3: ${{ env.PUBLISH }}
create_pr: ${{ env.PUBLISH }}
push_to_s3: ${{ startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release' }}
create_pr: ${{ startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release' }}
secrets:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_KEY_ID: ${{ secrets.AWS_SECRET_KEY_ID }}
4 changes: 2 additions & 2 deletions dockers/ci-runner-hpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
# gaudi-docker-agent:latest

ARG DIST="latest"
ARG GAUDI_VERSION="1.7.0"
ARG PYTORCH_INSTALLER_VERSION="1.12.0"
ARG GAUDI_VERSION="1.7.1"
ARG PYTORCH_INSTALLER_VERSION="1.13.0"
FROM vault.habana.ai/gaudi-docker/${GAUDI_VERSION}/ubuntu20.04/habanalabs/pytorch-installer-${PYTORCH_INSTALLER_VERSION}:${DIST}

LABEL maintainer="https://vault.habana.ai/"
Expand Down
1 change: 0 additions & 1 deletion docs/source-app/api_reference/runners.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,4 @@ ______________
:template: classtemplate.rst

~cloud.CloudRuntime
~singleprocess.SingleProcessRuntime
~multiprocess.MultiProcessRuntime
9 changes: 9 additions & 0 deletions docs/source-app/api_reference/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ______________
~path.Path
~drive.Drive
~payload.Payload
~mount.Mount

----

Expand Down Expand Up @@ -56,6 +57,14 @@ Learn more about Storage
:height: 180
:tag: Intermediate

.. displayitem::
:header: The Mount Object.
:description: Mount an AWS S3 Bucket When Running on the Cloud.
:col_css: col-md-4
:button_link: ../workflows/mount_aws_s3_bucket.html
:height: 180
:tag: Intermediate

.. raw:: html

</div>
Expand Down
26 changes: 25 additions & 1 deletion docs/source-app/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,20 @@ ___________________
:nosignatures:
:template: classtemplate_no_index.rst

~database.client.DatabaseClient
~database.server.Database
~python.popen.PopenPythonScript
~python.tracer.TracerPythonScript
~training.LightningTrainerScript
~serve.gradio.ServeGradio
~serve.serve.ModelInferenceAPI
~serve.python_server.PythonServer
~serve.streamlit.ServeStreamlit
~multi_node.base.MultiNode
~multi_node.lite.LiteMultiNode
~multi_node.pytorch_spawn.PyTorchSpawnMultiNode
~multi_node.trainer.LightningTrainerMultiNode
~auto_scaler.AutoScaler

----

Expand Down Expand Up @@ -71,6 +80,7 @@ _______
~path.Path
~drive.Drive
~payload.Payload
~mount.Mount

Learn more about :ref:`Storage <storage>`.

Expand All @@ -87,5 +97,19 @@ _______
:template: classtemplate_no_index.rst

~cloud.CloudRuntime
~singleprocess.SingleProcessRuntime
~multiprocess.MultiProcessRuntime

----

lightning_app.utilities.packaging
_________________________________

.. currentmodule:: lightning_app.utilities.packaging

.. autosummary::
:toctree: generated/
:nosignatures:
:template: classtemplate_no_index.rst

~cloud_compute.CloudCompute
~build_config.BuildConfig
7 changes: 7 additions & 0 deletions docs/source-app/glossary/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,13 @@ Glossary
:button_link: ../core_api/lightning_app/index.html
:height: 100

.. displayitem::
:header: Mounts
:description: Mount Cloud Data
:col_css: col-md-6
:button_link: mount.html
:height: 180

.. displayitem::
:header: Sharing Components
:description: Let's create an ecosystem altogether
Expand Down
1 change: 1 addition & 0 deletions docs/source-app/glossary/mount.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. include:: ../workflows/mount_cloud_object_store.rst
3 changes: 0 additions & 3 deletions docs/source-app/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,6 @@ We provide ``application_testing`` as a helper funtion to get your application u
os.path.join(_PROJECT_ROOT, "examples/app_v0/app.py"),
"--blocking",
"False",
"--multiprocess",
"--open-ui",
"False",
]
Expand All @@ -129,9 +128,7 @@ First in the list for ``command_line`` is the location of your script. It is an

Next there are a couple of options you can leverage:


* ``blocking`` - Blocking is an app status that says "Do not run until I click run in the UI". For our integration test, since we are not using the UI, we are setting this to "False".
* ``multiprocess/singleprocess`` - This is the runtime your app is expected to run under.
* ``open-ui`` - We set this to false since this is the routine that opens a browser for your local execution.

Once you have your commandline ready, you will then be able to kick off the test and gather results:
Expand Down
8 changes: 8 additions & 0 deletions docs/source-app/workflows/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,14 @@ How to:
:button_link: ssh/index.html
:height: 180

.. displayitem::
:header: Mount Cloud Data
:description: Learn how Lightning Mounts are used to make the contents of an cloud object store bucket available on disk when running in the cloud.
:col_css: col-md-4
:button_link: mount_cloud_object_store.html
:height: 180



.. raw:: html

Expand Down
141 changes: 141 additions & 0 deletions docs/source-app/workflows/mount_cloud_object_store.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
:orphan:

##############
Add Cloud Data
##############

**Audience:** Users who want to read files stored in a Cloud Object Bucket in an app.

******************************
Mounting Public AWS S3 Buckets
******************************

===================
Add Mount to a Work
===================

To mount data from a cloud bucket to your app compute, initialize a :class:`~lightning_app.storage.mount.Mount`
object with the source path of the s3 bucket and the absolute directory path where it should be mounted and
pass the :class:`~lightning_app.storage.mount.Mount` to the :class:`~lightning_app.utilities.packaging.cloud_compute.CloudCompute`
of the :class:`~lightning_app.core.work.LightningWork` it should be mounted on.

In this example, we will mount an S3 bucket: ``s3://ryft-public-sample-data/esRedditJson/`` to ``/content/esRedditJson/``.

.. code-block:: python

from lightning_app import CloudCompute
from lightning_app.storage import Mount

self.my_work = MyWorkClass(
cloud_compute=CloudCompute(
mounts=Mount(
source="s3://ryft-public-sample-data/esRedditJson/",
mount_path="/content/esRedditJson/",
),
)
)

You can also pass multiple mounts to a single work by passing a ``List[Mount(...), ...]`` to the
``CloudCompute(mounts=...)`` argument.

.. note::

* Mounts supported up to 1 Million files, 5GB per file. Need larger mounts? Contact [email protected]
* When adding multiple mounts, each one should have a unique ``mount_path``.
* A maximum of 10 :class:`~lightning_app.storage.mount.Mount`\s can be added to a :class:`~lightning_app.core.work.LightningWork`.

=======================
Read Files From a Mount
=======================

Once a :class:`~lightning_app.storage.mount.Mount` object is passed to :class:`~lightning_app.utilities.packaging.cloud_compute.CloudCompute`,
you can access, list, or read any file from the mount under the specified ``mount_path``, just like you would if it
was on your local machine.

Assuming your ``mount_path`` is ``"/content/esRedditJson/"`` you can do the following:

----------
Read Files
----------

.. code-block:: python

with open("/content/esRedditJson/esRedditJson1", "r") as f:
some_data = f.read()

# do something with "some_data"...

----------
List Files
----------

.. code-block:: python

files = os.listdir("/content/esRedditJson/")

--------------------
See the Full Example
--------------------

.. code-block:: python
:emphasize-lines: 10,15

import os

import lightning as L
from lightning_app import CloudCompute
from lightning_app.storage import Mount

class ReadMount(L.LightningWork):
def run(self):
# Print a list of files stored in the mounted S3 Bucket.
files = os.listdir("/content/esRedditJson/")
for file in files:
print(file)

# Read the contents of a particular file in the bucket "esRedditJson1"
with open("/content/esRedditJson/esRedditJson1", "r") as f:
some_data = f.read()
# do something with "some_data"...

class Flow(L.LightningFlow):
def __init__(self):
super().__init__()
self.my_work = ReadMount(
cloud_compute=CloudCompute(
mounts=Mount(
source="s3://ryft-public-sample-data/esRedditJson/",
mount_path="/content/esRedditJson/",
),
)
)

def run(self):
self.my_work.run()

.. note::

When running a Lighting App on your local machine, any :class:`~lightning_app.utilities.packaging.cloud_compute.CloudCompute`
configuration (including a :class:`~lightning_app.storage.mount.Mount`) is ignored at runtime. If you need access to
these files on your local disk, you should download a copy of them to your machine.

.. note::

Mounted files from an S3 bucket are ``read-only``. Any modifications, additions, or deletions
to files in the mounted directory will not be reflected in the cloud object store.

----

**********************************************
Mounting Private AWS S3 Buckets - Coming Soon!
**********************************************

We'll Let you know when this feature is ready!

----

************************************************
Mounting Google Cloud GCS Buckets - Coming Soon!
************************************************

We'll Let you know when this feature is ready!
1 change: 0 additions & 1 deletion docs/source-pytorch/accelerators/hpu_basic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,4 +113,3 @@ Known limitations
-----------------

* `Habana dataloader <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#habana-data-loader>`__ is not supported.
* :func:`torch.inference_mode` is not supported
2 changes: 1 addition & 1 deletion docs/source-pytorch/accelerators/hpu_intermediate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,4 +96,4 @@ The below snippet shows how DeviceStatsMonitor can be enabled.
device_stats = DeviceStatsMonitor()
trainer = Trainer(accelerator="hpu", callbacks=[device_stats])

For more details, please refer to `Memory Stats APIs <https://docs.habana.ai/en/v1.5.0/PyTorch/PyTorch_User_Guide/Python_Packages.html#memory-stats-apis>`__.
For more details, please refer to `Memory Stats APIs <https://docs.habana.ai/en/latest/PyTorch/PyTorch_User_Guide/Python_Packages.html#memory-stats-apis>`__.
3 changes: 1 addition & 2 deletions docs/source-pytorch/advanced/model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -424,10 +424,9 @@ You can customize the strategy configuration by adjusting the arguments of :clas

from pytorch_lightning import Trainer
from pytorch_lightning.strategies import DDPFullyShardedNativeStrategy
from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload


native_fsdp = DDPFullyShardedNativeStrategy(cpu_offload=CPUOffload(offload_params=True))
native_fsdp = DDPFullyShardedNativeStrategy(cpu_offload=True)
trainer = pl.Trainer(strategy=native_fsdp, accelerator="gpu", devices=4)


Expand Down
2 changes: 1 addition & 1 deletion docs/source-pytorch/clouds/cluster_intermediate_1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa

- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
- *WORLD_SIZE* - required; how many nodes are in the cluster
- *WORLD_SIZE* - required; the total number of GPUs/processes that you will use
- *NODE_RANK* - required; id of the node in the cluster

.. _training_script_setup:
Expand Down
2 changes: 1 addition & 1 deletion examples/app_dag/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
sklearn
scikit-learn
pandas
Loading