Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need update spyt and spyt on yt cluster when use custom docker image #39

Open
ykuc opened this issue Oct 8, 2024 · 1 comment
Open
Assignees

Comments

@ykuc
Copy link

ykuc commented Oct 8, 2024

Why do we need update spyt on yt cluster, if we use the custom image?

If spyt version not in yt cluster, I have error spyt version not in cluster

Docker image

# Dockerfile
FROM mirror.gcr.io/ubuntu:focal

USER root

RUN apt-get update && apt-get install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa

RUN apt-get update && DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y \
  containerd \
  curl \
  less \
  gdb \
  lsof \
  strace \
  telnet \
  tini \
  zstd \
  unzip \
  dnsutils \
  iputils-ping \
  lsb-release \
  openjdk-11-jdk \
  libidn11-dev \
  python3.12 \
  python3-pip \
  python3.12-dev \
  python3.12-distutils

RUN ln -s /usr/lib/jvm/java-11-openjdk-amd64 /opt/jdk11

RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1 \
    && update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1

COPY ./requirements.txt /requirements.txt

# Ensure pip is installed correctly
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
    && python3.12 get-pip.py \
    && python3.12 -m pip install --upgrade pip setuptools wheel \
    && rm get-pip.py


RUN python3.12 -m pip install -r requirements.txt
# requirements.txt
ytsaurus-client==0.13.18
ytsaurus-spyt==2.3.0
pyspark==3.3.4

Create a cluster with a docker image

spark-launch-yt \
 --spark-cluster-version 2.3.0 \
 --params '{operation_spec={tasks={history={docker_image="MY_DOCKER_IMAGE"};master={docker_image="MY_DOCKER_IMAGE"};workers={docker_image="MY_DOCKER_IMAGE"}}}}'
@alextokarew
Copy link
Collaborator

The main reason is that for now we have a job setup procedure which prepares the environment inside a job container (spyt-package/src/main/bash/setup-spyt-env.sh). This script uses spark and spyt archives from cypress and extracts them inside a contained during job startup. We consider refactoring it to decrease the startup time for vanilla operations.

You can easily install a required version of SPYT on the cluster using either our k8s operator (https://github.com/ytsaurus/ytsaurus-k8s-operator/blob/c439d7c703365a3d87827cc0bd3f0ac368eaa05d/config/samples/0.9.1/cluster_v1_demo.yaml#L113) or a spyt release docker image:

docker run --rm --network=host -e YT_PROXY=${yt proxy address} -e YT_USER="yt user" -e YT_TOKEN="yt token" -e EXTRA_SPARK_VERSIONS="3.2.2 3.3.4" ghcr.io/ytsaurus/spyt:2.3.0

The sources of spyt docker image can be found here: tools/release/spyt_image/Dockerfile. It contains a spyt distributive archive inside which is uploaded to cypress. The spark distributive is taken from the Internet from the official Spark site.

The release distributives is rather stable so you don't have to include it in your custom docker images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants