You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using merlin tensorflow container to build a docker image but it shows an error:
2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return _run_code(code, main_globals, None,
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - exec(code, run_globals)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/ads_content/batch/scripts/ads/ads_content/preranking/train_ohouse_ads_content_merlin.py", line 15, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - import merlin.models.tf as mm
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/__init__.py", line 108, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - from merlin.models.tf.models.retrieval import (
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/retrieval.py", line 22, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - from merlin.models.tf.prediction_tasks.retrieval import ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 33, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - class ItemRetrievalTask(MultiClassClassificationTask):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 70, in ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - DEFAULT_METRICS = TopKMetricsAggregator.default_metrics(top_ks=[10])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [INFO]: sparse_operation_kit is imported
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Initialize finished, communication tool: horovod
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 491, in default_metrics
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - metrics.extend([RecallAt(k), MRRAt(k), NDCGAt(k), AvgPrecisionAt(k), PrecisionAt(k)])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 362, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - super().__init__(recall_at, k=k, pre_sorted=pre_sorted, name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 234, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - super().__init__(name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/dtensor/utils.py", line 144, in _wrap_function
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - init_method(instance, *args, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 613, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - super().__init__(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 430, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - self.total = self.add_weight("total", initializer="zeros")
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 366, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return super().add_weight(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 712, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - variable = self._add_variable_with_custom_getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/trackable/base.py", line 489, in _add_variable_with_custom_getter
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - new_variable = getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer_utils.py", line 134, in make_variable
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return tf1.Variable(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - raise e.with_traceback(filtered_tb) from None
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - File "/usr/local/lib/python3.8/dist-packages/keras/initializers/initializers.py", line 171, in __call__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - return tf.zeros(shape, dtype)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Details
My dockerfile:
FROM --platform=linux/amd64 nvcr.io/nvidia/merlin/merlin-tensorflow:23.06 as prod
WORKDIR /ads_content
COPY ./data-airflow .
COPY ./ads/images/requirements.txt .
WORKDIR /root
RUN pip install tf2onnx==1.15.1
RUN pip install -r /ads_content/requirements.txt
RUN pip install requests "urllib3<2"
WORKDIR /ads_content
ENTRYPOINT ["python3"]
I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble.
I think I kept things pretty simple with my docker file - what am I doing wrong?
The text was updated successfully, but these errors were encountered:
❓ Questions & Help
Using merlin tensorflow container to build a docker image but it shows an error:
Details
My dockerfile:
I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble.
I think I kept things pretty simple with my docker file - what am I doing wrong?
The text was updated successfully, but these errors were encountered: