diff --git a/rsts/community/troubleshoot.rst b/rsts/community/troubleshoot.rst index af39507e32..9821d41017 100644 --- a/rsts/community/troubleshoot.rst +++ b/rsts/community/troubleshoot.rst @@ -1,126 +1,135 @@ .. _troubleshoot: +===================== Troubleshooting Guide ---------------------- +===================== .. tags:: Troubleshoot, Basic -.. admonition:: Why did we craft this guide? +The content in this section will help Flyte users isolate the most probable causes for some of the common issues that could arise while getting started with the project. - To help streamline your onboarding experience as much as possible, and sort out common issues. +Before getting started, collect the following information from the underlying infrastructure: -Here are a couple of techniques we believe could help get you up and running in no time! +- Capture the ``Status`` column from the output of: -Troubles With ``flytectl sandbox start`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. prompt:: bash $ -- The process hangs at ``Waiting for Flyte to become ready...`` for a while; OR -- It ends with a message ``Timed out while waiting for the datacatalog rollout to be created``. + $ kubectl describe pod -n -How Do I Debug? -""""""""""""""" +Where will typically correspond to the node execution string that you can find in the UI. -- Sandbox is a Docker container that runs Kubernetes and Flyte in it. So you can simply ``exec`` into it; +- Pay close attention to the `Events` section in the output. +- Also, collect the logs from the Pod: .. prompt:: bash $ - docker ps + $ kubectl logs pods -n -.. code-block:: +Where will typically correspond to the Flyte -, e.g. flytesnacks-development. - CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES - d3ab7e4cb17c cr.flyte.org/flyteorg/flyte-sandbox:dind "tini flyte-entrypoi…" 7 minutes ago Up 7 minutes 127.0.0.1:30081-30082->30081-30082/tcp, 127.0.0.1:30084->30084/tcp, 2375-2376/tcp, 127.0.0.1:30086->30086/tcp flyte-sandbox +Depending on the contents of the logs or the `Events`, you can try different things: -.. prompt:: bash $ +Debugging common execution errors +---------------------------------- - docker exec -it bash +``message: '0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.'`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -and run: :: +This issue is more common on MacOS devices. Make sure that your Docker daemon has allocated a minimum of 4 CPU cores and 3GB of RAM - kubectl get pods -n flyte +``terminated with exit code (137). Reason [OOMKilled]`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -You can check on the pending pods and perform a detailed check as to why a pod is failing by running: :: +- For single binary environment deployed with Helm chart, make sure you are using `the most recent charts `_ - kubectl describe po -n flyte +- For EKS deployments, you cand adjust resource limits and requests in the `inline `_ section of the ``eks-production.yaml`` file. Example: -- Also, you can use this command to simply export this variable to use local kubectl:: +.. code-block:: yaml - export KUBECONFIG=$HOME/.flyte/k3s/k3s.yaml + inline: + task_resources: + defaults: + cpu: 100m + memory: 100Mi + storage: 100Mi + limits: + memory: 1Gi -- If you would like to reclaim disk space, run: :: +- Also, the default container resource limits are can be overriden from the task itself: - docker system prune [OPTIONS] +.. code-block:: python -- Increase mem/CPU available for Docker. + from flytekit import Resources, task + @task(limits=Resources(mem="256Mi") + def your_task(... +``Error: ImagePullBackOff`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Troubles With ``flyte sandbox`` Log Viewing -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +- If your environment requires the use of a network proxy use the ``--env`` option when starting the sandbox and pass the proxy configuration: -- When testing locally using the ``flyte sandbox`` command, one way to view the logs is using the ``Kubernetes Logs (User)`` option on the FlyteConsole. -- This takes you to the Kubernetes dashboard which requires a login. +.. prompt:: bash $ -:: + $ flytectl demo start --env HTTP_PROXY= - kind: Deployment - apiVersion: apps/v1 - metadata: - name: kubernetes-dashboard - namespace: kubernetes-dashboard - spec: - template: - spec: - containers: - - name: kubernetes-dashboard - args: - - --namespace=kubernetes-dashboard - - --enable-insecure-login - - --enable-skip-login - - --disable-settings-authorizer +- If you're building a custom Docker image, make sure to use a tag other than ``latest``. Otherwise, the Kubernetes default pull policy will be changed from ``IfNotPresent`` to ``Always``, forcing an image pull with every Pod deployment. -.. note:: +Issues running workloads +------------------------- - There is a ``skip`` button that takes you straight to the logs without logging in. +``OPENSSL_internal:WRONG_VERSION_NUMBER`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Troubles With Flytectl Commands Within Proxy Settings -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +- For ``flyte-binary``: make sure that the endpoint name you have set in your ``config.yaml`` file, is included in the DNS names of the SSL certificate installed (be it self signed or issued by a Certificate Authority) +- For ``sandbox``: verify the ``FLYTECTL_CONFIG`` environment variable has the correct value by running: -- Flytectl uses gRPC APIs of FlyteAdmin to administer Flyte resources and in the case of proxy settings, it uses an additional ``CONNECT`` handshake at the gRPC layer to perform the same. Additional info is available on this `gRPC proxy documentation `__ page. +.. prompt:: bash $ -- In the Windows environment, it has been noticed that the ``NO_PROXY`` variable doesn't work to bypass the proxy settings. This `GRPC issue `__ provides additional details, though it doesn't seem to have been tested on Windows yet. To bypass this issue, unset both ``HTTP_PROXY`` and ``HTTPS_PROXY`` variables. + $ export FLYTECTL_CONFIG=~/.flyte/config-sandbox.yaml -Troubles With Flytectl Commands With Cloudflare DNS -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``ModuleNotFoundError`` +^^^^^^^^^^^^^^^^^^^^^^^ -- Flytectl produces permission errors with Cloudflare DNS endpoints -- Cloudflare instance proxies by default the requests and filters out gRPC. -- **To fix this**: - - Enable gRPC in the network tab; or - - Turn off the proxy. +- If you're using a custom container image and using Docker, make sure your ``Dockerfile`` is located at the same level of the ``flyte`` directory and that there is an empty ``__init__.py`` file in your project's folder : -Troubles With Flytectl Commands With Auth Enabled -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. prompt:: bash $ + + myflyteapp + ├── Dockerfile + ├── docker_build_and_tag.sh + ├── flyte + │ ├── __init__.py + │ └── workflows + │ ├── __init__.py + │ └── example.py + └── requirements.txt -- Flytectl commands use OpenID connect if auth is enabled in the Flyte environment -- It opens an ``HTTP`` server port on localhost:53593. It has a callback endpoint for the OpenID connect server to call into for the response. - - If the callback server call fails, please check if flytectl failed to run the server. - - Verify that you have an entry for localhost in your ``/etc/hosts`` file. - - It could also mean that the callback took longer than the default 15 secs, and the flytectl wait deadline expired. +``An error occurred (AccessDenied) when calling the PutObject operation`` in an EKS deployment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Troubles With Inconsistent Names for Pods and Downstream Resources -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +- Make sure that the Kubernetes service account Flyte is using has the annotation that refers to the IAM Role is connected to: -- Don't rely on the name of a Flyte node to always match the name of its corresponding Kubernetes pod or downstream resource -- Flyte uses the format ``executionid-node-id-attempt`` from the node to assign a name to a Kubernetes pod or downstream resource. -- But if this is an invalid name for a Kubernetes pod, Flyte assigns a valid name of random characters instead. +.. prompt:: bash $ -Troubles with handling large responses in ``FlyteRemote.sync`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + $ kubectl describe sa -n -- ``Received message larger than max (xxx vs. 4194304)`` usually crops up when the message size is too large. -- To fix this, edit the ``flyte-admin-base-config`` config map using the command ``kubectl edit cm flyte-admin-base-config -n flyte`` to increase the ``maxMessageSizeBytes`` value. +Example output: + +.. prompt:: bash $ + + Name: + Namespace: flyte + Labels: app.kubernetes.io/managed-by=eksctl + Annotations: eks.amazonaws.com/role-arn: arn:aws:iam:::role/flyte-system-role + Image pull secrets: + Mountable secrets: + Tokens: + Events: + +- Otherwise, obtain your IAM role's ARN and manually annotate the service account: + +.. prompt:: bash $ + $ kubectl annotate serviceaccount -n eks.amazonaws.com/role-arn=arn:aws:iam::xxxx:role/ -I Still Need Help! -^^^^^^^^^^^^^^^^^^ -Our `Slack `__ community is always available and ready to help! +- Refer to this community-maintained `guides `_ for further information about Flyte deployment on EKS