Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Status: CUDA driver version is insufficient for CUDA runtime version #1234

Open
dking21st opened this issue Dec 22, 2023 · 0 comments
Open

Comments

@dking21st
Copy link

dking21st commented Dec 22, 2023

❓ Questions & Help

Using merlin tensorflow container to build a docker image but it shows an error:

2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return _run_code(code, main_globals, None,
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     exec(code, run_globals)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/ads_content/batch/scripts/ads/ads_content/preranking/train_ohouse_ads_content_merlin.py", line 15, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     import merlin.models.tf as mm
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/__init__.py", line 108, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     from merlin.models.tf.models.retrieval import (
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/retrieval.py", line 22, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     from merlin.models.tf.prediction_tasks.retrieval import ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 33, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     class ItemRetrievalTask(MultiClassClassificationTask):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 70, in ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     DEFAULT_METRICS = TopKMetricsAggregator.default_metrics(top_ks=[10])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [INFO]: sparse_operation_kit is imported
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Initialize finished, communication tool: horovod
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 491, in default_metrics
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     metrics.extend([RecallAt(k), MRRAt(k), NDCGAt(k), AvgPrecisionAt(k), PrecisionAt(k)])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 362, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(recall_at, k=k, pre_sorted=pre_sorted, name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 234, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/dtensor/utils.py", line 144, in _wrap_function
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     init_method(instance, *args, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 613, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 430, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     self.total = self.add_weight("total", initializer="zeros")
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 366, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return super().add_weight(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 712, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     variable = self._add_variable_with_custom_getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/trackable/base.py", line 489, in _add_variable_with_custom_getter
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     new_variable = getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer_utils.py", line 134, in make_variable
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return tf1.Variable(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     raise e.with_traceback(filtered_tb) from None
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/initializers/initializers.py", line 171, in __call__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return tf.zeros(shape, dtype)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Details

My dockerfile:

FROM --platform=linux/amd64 nvcr.io/nvidia/merlin/merlin-tensorflow:23.06 as prod

WORKDIR /ads_content

COPY ./data-airflow .
COPY ./ads/images/requirements.txt .

WORKDIR /root

RUN pip install tf2onnx==1.15.1 
RUN pip install -r /ads_content/requirements.txt
RUN pip install requests "urllib3<2"

WORKDIR /ads_content

ENTRYPOINT ["python3"]

I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble.
I think I kept things pretty simple with my docker file - what am I doing wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant