Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary support for on-prem transfers with HDFS and POSIX path support #735

Merged
merged 35 commits into from
Mar 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
42aa7a9
File system interface
ShishirPatil Nov 16, 2022
3b1e8b2
Moving away from mnt
ShishirPatil Nov 22, 2022
a5ee402
Iterator return object of type file
ShishirPatil Nov 22, 2022
a103950
Merge branch 'dev/gateway' into 'dev/shishir/on-prem'
ShishirPatil Nov 22, 2022
4b13c6d
[onprem] HDFS Interface (#684)
HaileyJang Dec 7, 2022
e9d333c
Revert "[onprem] HDFS Interface " (#716)
ShishirPatil Dec 7, 2022
ee4fa78
[onprem] HDFS Interface (#719)
HaileyJang Dec 30, 2022
a1aa397
Fix test_hdfs workflow to terminate the cluster whenever
HaileyJang Dec 30, 2022
295b37a
Upstream POSIX file system interface
ShishirPatil Jan 2, 2023
353964e
[On-prem] Fix Test_HDFS (#729)
HaileyJang Jan 2, 2023
8357a77
Catch multipart for hdfs
ShishirPatil Jan 2, 2023
6035bd6
Fix: test_hdfs bugs
ShishirPatil Jan 6, 2023
dedb577
Use gateways
ShishirPatil Jan 6, 2023
2bbd728
Fix: File system interface
ShishirPatil Jan 6, 2023
b16267e
port from aws to gcp
ShishirPatil Jan 12, 2023
75aae41
List object fixed
HaileyJang Jan 17, 2023
987b5b3
Fix: test hdfs for aws, posix file system
ShishirPatil Jan 21, 2023
131c5eb
Fix Merge to main merge conflict
HaileyJang Jan 27, 2023
f5a3241
Adding local, nfs, hdfs portion to CLI
HaileyJang Jan 27, 2023
c21bd05
Fix dependency issues
HaileyJang Jan 27, 2023
9174d82
Minor mods:Resolving comments for PR
ShishirPatil Jan 30, 2023
c99ca36
Merge branch 'main' into dev/shishir/on-prem
ShishirPatil Jan 30, 2023
5601cf3
Fix upto gateway creation
HaileyJang Feb 1, 2023
0497e5d
Update skyplane/obj_store/object_store_interface.py
HaileyJang Feb 1, 2023
f9042e3
Merge branch 'main' into dev/shishir/on-prem
ShishirPatil Feb 14, 2023
a00829c
fix: lint
ShishirPatil Feb 14, 2023
b8d766c
End-to-End Integration for HDFS (#758)
HaileyJang Feb 24, 2023
b1c0bfe
Fix legacy codes from previous runs
HaileyJang Feb 28, 2023
a66073c
Fix all the comments
HaileyJang Mar 9, 2023
9b9c445
Fix black pytype issue
HaileyJang Mar 10, 2023
e05c323
Update Pyproject for dataproc
HaileyJang Mar 10, 2023
21e834d
Add readme
HaileyJang Mar 13, 2023
4cadac8
Delete hostname
HaileyJang Mar 13, 2023
b039b35
Fix merge conflict with main
HaileyJang Mar 13, 2023
f3052d0
GCP test should pass now:
HaileyJang Mar 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@ jobs:
run: |
poetry install -E gateway -E solver -E aws -E azure -E gcp
poetry run pip install -r requirements-dev.txt
poetry run sudo apt install default-jdk
poetry run wget https://archive.apache.org/dist/hadoop/core/hadoop-3.3.0/hadoop-3.3.0.tar.gz -P /tmp && tar -xzf /tmp/hadoop-3.3.0.tar.gz -C /tmp && sudo mv /tmp/hadoop-3.3.0 /usr/local/hadoop && rm /tmp/hadoop-3.3.0.tar.gz
if: steps.cache.outputs.cache-hit != 'true'
- name: Run cloud tests
env:
Expand Down Expand Up @@ -148,10 +150,13 @@ jobs:
run: |
poetry config virtualenvs.in-project false
poetry config virtualenvs.path ~/.virtualenvs

- name: Install Dependencies
run: |
poetry install -E gateway -E solver -E aws -E azure -E gcp
poetry run pip install -r requirements-dev.txt
poetry run sudo apt install default-jdk
poetry run wget https://archive.apache.org/dist/hadoop/core/hadoop-3.3.0/hadoop-3.3.0.tar.gz -P /tmp && tar -xzf /tmp/hadoop-3.3.0.tar.gz -C /tmp && sudo mv /tmp/hadoop-3.3.0 /usr/local/hadoop && rm /tmp/hadoop-3.3.0.tar.gz
if: steps.cache.outputs.cache-hit != 'true'
- id: 'auth'
uses: 'google-github-actions/auth@v0'
Expand Down
22 changes: 20 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,25 @@
FROM python:3.11-slim

# install apt packages
RUN --mount=type=cache,target=/var/cache/apt apt update \
&& apt-get install --no-install-recommends -y curl ca-certificates stunnel4 gcc libc-dev \
RUN --mount=type=cache,target=/var/cache/apt apt-get update \
&& apt-get install --no-install-recommends -y curl ca-certificates stunnel4 gcc libc-dev wget \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

#install HDFS Onprem Packages
RUN apt-get update && \
apt-get install -y openjdk-11-jdk && \
apt-get clean

ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64

RUN wget https://archive.apache.org/dist/hadoop/core/hadoop-3.3.0/hadoop-3.3.0.tar.gz -P /tmp \
&& tar -xzf /tmp/hadoop-3.3.0.tar.gz -C /tmp \
&& mv /tmp/hadoop-3.3.0 /usr/local/hadoop \
&& rm /tmp/hadoop-3.3.0.tar.gz

ENV HADOOP_HOME /usr/local/hadoop

# configure stunnel
RUN mkdir -p /etc/stunnel \
&& openssl genrsa -out key.pem 2048 \
Expand All @@ -31,6 +45,10 @@ RUN (echo 'net.ipv4.ip_local_port_range = 12000 65535' >> /etc/sysctl.conf) \

# install gateway
COPY scripts/requirements-gateway.txt /tmp/requirements-gateway.txt

#Onprem: Install Hostname Resolution for HDFS
COPY scripts/on_prem/hostname /tmp/hostname

RUN --mount=type=cache,target=/root/.cache/pip pip3 install --no-cache-dir -r /tmp/requirements-gateway.txt && rm -r /tmp/requirements-gateway.txt

WORKDIR /pkg
Expand Down
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Contents
quickstart
configure
architecture
performance_stats_collection
on-prem_setup
faq

.. toctree::
Expand Down
30 changes: 30 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,33 @@ Now, you can initialize Skyplane with your desired cloud providers. Skyplane aut

---> Setup cloud provider connectors:
$ skyplane init


Transferring Data
-------------------

We're ready to use Skyplane! Let's use `skyplane cp` to copy files from AWS to GCP:

.. code-block:: bash

---> 🎸 Ready to rock and roll! Copy some files:
$ skyplane cp -r s3://... gs://...

To transfer only new objects, you can instead use `skyplane sync`:

.. code-block:: bash

---> Copy only diff
$ skyplane sync s3://... gs://...

To transfer from local disk or HDFS cluster, you can use `skyplane cp` as well:

(Note: On-Prem require additional setup. Please navigate to the `On-Prem` section for more details)

.. code-block:: bash

---> Copy from local disk
$ skyplane cp -r /path/to/local/file gs://...

---> Copy from HDFS
$ skyplane cp -r hdfs://... gs://...
41 changes: 41 additions & 0 deletions docs/on-prem_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# On-Prem Transfers

Currently Skyplane supports On-prem from local disk, NFS, HDFS to cloud storages.

## HDFS Setup

Skyplane utilizes Pyarrow and libhdfs for HDFS connection.

**Transfer from HDFS requires prior Hadoop and Java installation.**

* Please refer to [Pyarrow HDFS documentation](https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs) for necessary environment variable setup.

Note that the cluster needs to communicate to the Skyplane gateway. Please change the incoming firewall for the clusters to allow traffic from Skyplane.

### Resolving HDFS Datanodes

A file called `hostname` is under the `./skyplane/scripts/on_prem` folder. This file will be used for hostname/datanode IP resolution. This is for datanode's internal IP resolution.

* Copy the hostname/Internal IP for each datanode and the external ip for the corresponding datanode to the file.

* The hostname after writing all the required information should look like this.
```text
<External IP> <Datanodes' Hostname or Internal IP>
```



### Testing the Transfer

Now you can test running `skyplane cp` to transfer from local disk or HDFS cluster to any cloud storages.


```bash

---> Copy from local disk
$ skyplane cp -r /path/to/local/file gs://...

---> Copy from HDFS
$ skyplane cp -r hdfs://... gs://...

```
Loading