Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error in AzureML Spark nightly test #1898

Closed
miguelgfierro opened this issue Mar 2, 2023 · 6 comments
Closed

[BUG] Error in AzureML Spark nightly test #1898

miguelgfierro opened this issue Mar 2, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@miguelgfierro
Copy link
Collaborator

Description

After we upgrade the CPU VM to a higher tier (see #1897), we got an error in the Spark nightly:

E           papermill.exceptions.PapermillExecutionError: 
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [16]":
E           ---------------------------------------------------------------------------
E           Py4JJavaError                             Traceback (most recent call last)
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/IPython/core/magics/execution.py:1318, in ExecutionMagics.time(self, line, cell, local_ns)
E              1317 try:
E           -> 1318     exec(code, glob, local_ns)
E              1319     out=None
E           
E           File <timed exec>:45
E           
E           Cell In[14], line 2, in <lambda>(test, predictions)
E                 1 rating_evaluator = ***
E           ----> 2     "als": lambda test, predictions: rating_metrics_pyspark(test, predictions),
E                 3     "svd": lambda test, predictions: rating_metrics_python(test, predictions),
E                 4     "fastai": lambda test, predictions: rating_metrics_python(test, predictions)
E                 5 ***
E                 8 ranking_evaluator = ***
E                 9     "als": lambda test, predictions, k: ranking_metrics_pyspark(test, predictions, k),
E                10     "sar": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E              (...)
E                16     "lightgcn": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                17 ***
E           
E           File /mnt/azureml/cr/j/6abc2cf9fbcc4ce985da77dc3549f875/exe/wd/examples/06_benchmarks/benchmark_utils.py:372, in rating_metrics_pyspark(test, predictions)
E               371 def rating_metrics_pyspark(test, predictions):
E           --> 372     rating_eval = SparkRatingEvaluation(test, predictions, **COL_DICT)
E               373     return ***
E               374         "RMSE": rating_eval.rmse(),
E               375         "MAE": rating_eval.mae(),
E               376         "R2": rating_eval.exp_var(),
E               377         "Explained Variance": rating_eval.rsquared(),
E               378     ***
E           
E           File /mnt/azureml/cr/j/6abc2cf9fbcc4ce985da77dc3549f875/exe/wd/recommenders/evaluation/spark_evaluation.py:82, in SparkRatingEvaluation.__init__(self, rating_true, rating_pred, col_user, col_item, col_rating, col_prediction)
E                81     raise ValueError("Empty input dataframe")
E           ---> 82 if rating_pred.count() == 0:
E                83     raise ValueError("Empty input dataframe")
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/pyspark/sql/dataframe.py:804, in DataFrame.count(self)
E               795 """Returns the number of rows in this :class:`DataFrame`.
E               796 
E               797 .. versionadded:: 1.3.0
E              (...)
E               802 2
E               803 """
E           --> 804 return int(self._jdf.count())
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
E              1320 answer = self.gateway_client.send_command(command)
E           -> 1321 return_value = get_return_value(
E              1322     answer, self.gateway_client, self.target_id, self.name)
E              1324 for temp_arg in temp_args:
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
E               189 try:
E           --> 190     return f(*a, **kw)
E               191 except Py4JJavaError as e:
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
E               325 if answer[1] == REFERENCE_TYPE:
E           --> 326     raise Py4JJavaError(
E               327         "An error occurred while calling ***0***1***2***.\n".
E               328         format(target_id, ".", name), value)
E               329 else:
E           
E           <class 'str'>: (<class 'ConnectionRefusedError'>, ConnectionRefusedError(111, 'Connection refused'))
E           
E           During handling of the above exception, another exception occurred:
E           
E           ConnectionRefusedError                    Traceback (most recent call last)




E           papermill.exceptions.PapermillExecutionError: 
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [16]":
E           ---------------------------------------------------------------------------
E           Py4JJavaError                             Traceback (most recent call last)
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/IPython/core/magics/execution.py:1318, in ExecutionMagics.time(self, line, cell, local_ns)
E              1317 try:
E           -> 1318     exec(code, glob, local_ns)
E              1319     out=None
E           
E           File <timed exec>:45
E           
E           Cell In[14], line 2, in <lambda>(test, predictions)
E                 1 rating_evaluator = ***
E           ----> 2     "als": lambda test, predictions: rating_metrics_pyspark(test, predictions),
E                 3     "svd": lambda test, predictions: rating_metrics_python(test, predictions),
E                 4     "fastai": lambda test, predictions: rating_metrics_python(test, predictions)
E                 5 ***
E                 8 ranking_evaluator = ***
E                 9     "als": lambda test, predictions, k: ranking_metrics_pyspark(test, predictions, k),
E                10     "sar": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E              (...)
E                16     "lightgcn": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                17 ***
E           
E           File /mnt/azureml/cr/j/6abc2cf9fbcc4ce985da77dc3549f875/exe/wd/examples/06_benchmarks/benchmark_utils.py:372, in rating_metrics_pyspark(test, predictions)
E               371 def rating_metrics_pyspark(test, predictions):
E           --> 372     rating_eval = SparkRatingEvaluation(test, predictions, **COL_DICT)
E               373     return ***
E               374         "RMSE": rating_eval.rmse(),
E               375         "MAE": rating_eval.mae(),
E               376         "R2": rating_eval.exp_var(),
E               377         "Explained Variance": rating_eval.rsquared(),
E               378     ***
E           
E           File /mnt/azureml/cr/j/6abc2cf9fbcc4ce985da77dc3549f875/exe/wd/recommenders/evaluation/spark_evaluation.py:82, in SparkRatingEvaluation.__init__(self, rating_true, rating_pred, col_user, col_item, col_rating, col_prediction)
E                81     raise ValueError("Empty input dataframe")
E           ---> 82 if rating_pred.count() == 0:
E                83     raise ValueError("Empty input dataframe")
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/pyspark/sql/dataframe.py:804, in DataFrame.count(self)
E               795 """Returns the number of rows in this :class:`DataFrame`.
E               796 
E               797 .. versionadded:: 1.3.0
E              (...)
E               802 2
E               803 """
E           --> 804 return int(self._jdf.count())
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
E              1320 answer = self.gateway_client.send_command(command)
E           -> 1321 return_value = get_return_value(
E              1322     answer, self.gateway_client, self.target_id, self.name)
E              1324 for temp_arg in temp_args:
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
E               189 try:
E           --> 190     return f(*a, **kw)
E               191 except Py4JJavaError as e:
E           
E           File /azureml-envs/azureml_18102efe6b97bde44cf802374de73396/lib/python3.9/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
E               325 if answer[1] == REFERENCE_TYPE:
E           --> 326     raise Py4JJavaError(
E               327         "An error occurred while calling ***0***1***2***.\n".
E               328         format(target_id, ".", name), value)
E               329 else:
E           
E           <class 'str'>: (<class 'ConnectionRefusedError'>, ConnectionRefusedError(111, 'Connection refused'))
E           
E           During handling of the above exception, another exception occurred:
E           
E           ConnectionRefusedError                    Traceback (most recent call last)



	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/26/temp_shuffle_09251939-1973-40ac-bfbe-093c81f8b7b1
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/0e/temp_shuffle_caf293c6-d182-442b-a78d-3eeb783e2014
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/24/temp_shuffle_4f758882-c6c9-408b-a314-77aa39e26848
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/1a/temp_shuffle_971782ea-211f-4985-9432-a575c6ec8a2a
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/26/temp_shuffle_f56e696e-de2e-41c4-b77d-81cbf0439801
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/0c/temp_shuffle_3b78c690-f247-49e7-afe0-85777a2de104
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/28/temp_shuffle_d671fbef-ab20-4908-ae22-2e866f7ad08e
23/03/01 21:24:42 WARN DiskBlockObjectWriter: Error deleting /tmp/blockmgr-5e0c5609-34d6-4017-a8a5-b45bccf081d7/09/temp_shuffle_cffadd45-508c-4fd9-8d6b-bc6c6f227f03

See full stack: https://github.com/microsoft/recommenders/actions/runs/4307778700/jobs/7513250712#step:17:32694

In which platform does it happen?

AzureML VM spark

How do we replicate the issue?

Rerun https://github.com/microsoft/recommenders/actions/runs/4307778700/jobs/7513250712#step:17:32694

Expected behavior (i.e. solution)

Tests in green

Other Comments

@miguelgfierro miguelgfierro added the bug Something isn't working label Mar 2, 2023
@miguelgfierro
Copy link
Collaborator Author

In this run, python 3.8 works, but 3.9 fails: https://github.com/microsoft/recommenders/actions/runs/4317256439/jobs/7534098288

@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Mar 9, 2023

Trying in local:

(reco_pyspark) miguel@miguel:~/MS/recommenders$ pytest tests/integration/examples/test_notebooks_pyspark.py::test_benchmark_movielens_pyspark
================================================= test session starts ==================================================platform linux -- Python 3.8.15, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/miguel/MS/recommenders, configfile: pyproject.toml
plugins: hypothesis-6.58.1, anyio-3.6.2
collected 1 item

tests/integration/examples/test_notebooks_pyspark.py



============================================================================== FAILURES ===============================================================================
________________________________________________ test_benchmark_movielens_pyspark[size0-algos0-expected_values_ndcg0] _________________________________________________
notebooks = {'als_deep_dive': '/home/miguel/MS/recommenders/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'als_p...ne_deep_dive.ipynb', 'benchmark_movielens': '/home/miguel/MS/recommenders/examples/06_benchmarks/movielens.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3', size = ['100k'], algos = ['als'], expected_values_ndcg = [0.035812]

    @pytest.mark.spark
    @pytest.mark.notebooks
    @pytest.mark.integration
    @pytest.mark.parametrize(
        "size, algos, expected_values_ndcg",
        [
            (
                ["100k"],
                ["als"],
                [0.035812]
            ),
        ],
    )
    def test_benchmark_movielens_pyspark(notebooks, output_notebook, kernel_name, size, algos, expected_values_ndcg):
        notebook_path = notebooks["benchmark_movielens"]
>       pm.execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(data_sizes=size, algorithms=algos),
        )

tests/integration/examples/test_notebooks_pyspark.py:82:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../anaconda/envs/reco_pyspark/lib/python3.8/site-packages/papermill/execute.py:128: in execute_notebook
    raise_for_execution_errors(nb, output_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

nb = {'cells': [{'id': 'ef3efa0e', 'cell_type': 'markdown', 'source': '<span style="color:red; font-family:Helvetica Neue, ...nd_time': '2023-03-09T14:27:03.893475', 'duration': 93.238769, 'exception': None}}, 'nbformat': 4, 'nbformat_minor': 5}
output_path = 'output.ipynb'

    def raise_for_execution_errors(nb, output_path):
        """Assigned parameters into the appropriate place in the input notebook

        Parameters
        ----------
        nb : NotebookNode
           Executable notebook object
        output_path : str
           Path to write executed notebook
        """
        error = None
        for index, cell in enumerate(nb.cells):
            if cell.get("outputs") is None:
                continue

            for output in cell.outputs:
                if output.output_type == "error":
                    if output.ename == "SystemExit" and (output.evalue == "" or output.evalue == "0"):
                        continue
                    error = PapermillExecutionError(
                        cell_index=index,
                        exec_count=cell.execution_count,
                        source=cell.source,
                        ename=output.ename,
                        evalue=output.evalue,
                        traceback=output.traceback,
                    )
                    break

        if error:
            # Write notebook back out with the Error Message at the top of the Notebook, and a link to
            # the relevant cell (by adding a note just before the failure with an HTML anchor)
            error_msg = ERROR_MESSAGE_TEMPLATE % str(error.exec_count)
            error_msg_cell = nbformat.v4.new_markdown_cell(error_msg)
            error_msg_cell.metadata['tags'] = [ERROR_MARKER_TAG]
            error_anchor_cell = nbformat.v4.new_markdown_cell(ERROR_ANCHOR_MSG)
            error_anchor_cell.metadata['tags'] = [ERROR_MARKER_TAG]

            # put the anchor before the cell with the error, before all the indices change due to the
            # heading-prepending
            nb.cells.insert(error.cell_index, error_anchor_cell)
            nb.cells.insert(0, error_msg_cell)

            write_ipynb(nb, output_path)
>           raise error
E           papermill.exceptions.PapermillExecutionError:
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [16]":
E           ---------------------------------------------------------------------------
E           Py4JJavaError                             Traceback (most recent call last)
E           File <timed exec>:56
E
E           Cell In[14], line 9, in <lambda>(test, predictions, k)
E                 1 rating_evaluator = {
E                 2     "als": lambda test, predictions: rating_metrics_pyspark(test, predictions),
E                 3     "svd": lambda test, predictions: rating_metrics_python(test, predictions),
E                 4     "fastai": lambda test, predictions: rating_metrics_python(test, predictions)
E                 5 }
E                 8 ranking_evaluator = {
E           ----> 9     "als": lambda test, predictions, k: ranking_metrics_pyspark(test, predictions, k),
E                10     "sar": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                11     "svd": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                12     "fastai": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                13     "ncf": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                14     "bpr": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                15     "bivae": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                16     "lightgcn": lambda test, predictions, k: ranking_metrics_python(test, predictions, k),
E                17 }
E
E           File ~/MS/recommenders/examples/06_benchmarks/benchmark_utils.py:382, in ranking_metrics_pyspark(test, predictions, k)
E               381 def ranking_metrics_pyspark(test, predictions, k=DEFAULT_K):
E           --> 382     rank_eval = SparkRankingEvaluation(
E               383         test, predictions, k=k, relevancy_method="top_k", **COL_DICT
E               384     )
E               385     return {
E               386         "MAP": rank_eval.map_at_k(),
E               387         "nDCG@k": rank_eval.ndcg_at_k(),
E               388         "Precision@k": rank_eval.precision_at_k(),
E               389         "Recall@k": rank_eval.recall_at_k(),
E               390     }
E
E           File ~/MS/recommenders/recommenders/evaluation/spark_evaluation.py:284, in SparkRankingEvaluation.__init__(self, rating_true, rating_pred, k, relevancy_method, col_user, col_item, col_rating, col_prediction, threshold)
E               260     raise ValueError(
E               261         "relevancy_method should be one of {}".format(
E               262             list(relevant_func.keys())
E               263         )
E               264     )
E               266 self.rating_pred = (
E               267     relevant_func[relevancy_method](
E               268         dataframe=self.rating_pred,
E              (...)
E               281     )
E               282 )
E           --> 284 self._metrics = self._calculate_metrics()
E
E           File ~/MS/recommenders/recommenders/evaluation/spark_evaluation.py:300, in SparkRankingEvaluation._calculate_metrics(self)
E               290 self._items_for_user_true = (
E               291     self.rating_true.groupBy(self.col_user)
E               292     .agg(expr("collect_list(" + self.col_item + ") as ground_truth"))
E               293     .select(self.col_user, "ground_truth")
E               294 )
E               296 self._items_for_user_all = self._items_for_user_pred.join(
E               297     self._items_for_user_true, on=self.col_user
E               298 ).drop(self.col_user)
E           --> 300 return RankingMetrics(self._items_for_user_all.rdd)
E
E           File ~/anaconda/envs/reco_pyspark/lib/python3.8/site-packages/pyspark/sql/dataframe.py:175, in DataFrame.rdd(self)
E               173 """Returns the content as an :class:`pyspark.RDD` of :class:`Row`."""
E               174 if self._lazy_rdd is None:
E           --> 175     jrdd = self._jdf.javaToPython()
E               176     self._lazy_rdd = RDD(
E               177         jrdd, self.sparkSession._sc, BatchedSerializer(CPickleSerializer())
E               178     )
E               179 return self._lazy_rdd
E
E           File ~/anaconda/envs/reco_pyspark/lib/python3.8/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
E              1315 command = proto.CALL_COMMAND_NAME +\
E              1316     self.command_header +\
E              1317     args_command +\
E              1318     proto.END_COMMAND_PART
E              1320 answer = self.gateway_client.send_command(command)
E           -> 1321 return_value = get_return_value(
E              1322     answer, self.gateway_client, self.target_id, self.name)
E              1324 for temp_arg in temp_args:
E              1325     temp_arg._detach()
E
E           File ~/anaconda/envs/reco_pyspark/lib/python3.8/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
E               188 def deco(*a: Any, **kw: Any) -> Any:
E               189     try:
E           --> 190         return f(*a, **kw)
E               191     except Py4JJavaError as e:
E               192         converted = convert_exception(e.java_exception)
E
E           File ~/anaconda/envs/reco_pyspark/lib/python3.8/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
E               324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
E               325 if answer[1] == REFERENCE_TYPE:
E           --> 326     raise Py4JJavaError(
E               327         "An error occurred while calling {0}{1}{2}.\n".
E               328         format(target_id, ".", name), value)
E               329 else:
E               330     raise Py4JError(
E               331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
E               332         format(target_id, ".", name, value))
E
E           Py4JJavaError: An error occurred while calling o352.javaToPython.
E           : java.lang.StackOverflowError
E               at java.base/java.lang.Exception.<init>(Exception.java:102)
E               at java.base/java.lang.ReflectiveOperationException.<init>(ReflectiveOperationException.java:89)
E               at java.base/java.lang.reflect.InvocationTargetException.<init>(InvocationTargetException.java:73)
E               at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
E               at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E               at java.base/java.lang.reflect.Method.invoke(Method.java:566)
E               at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1145)
E               at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1497)
E               at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
E               at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
E               at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
E               at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
E               at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)

Tried in local, and it worked when I set up the PYSPARK variables:

    os.environ["PYSPARK_PYTHON"]="/home/miguel/anaconda/envs/reco_pyspark/bin/python"
    os.environ["PYSPARK_DRIVER_PYTHON"]="/home/miguel/anaconda/envs/reco_pyspark/bin/python"

@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Mar 9, 2023

Tried - DIDN'T WORK

env:
        PYSPARK_PYTHON: python3
        PYSPARK_DRIVER_PYTHON: python3

Tried: DIDN'T WORK

env:
        PYSPARK_PYTHON: "python3"
        PYSPARK_DRIVER_PYTHON: "python3"

It worked in python 3.8, but not in 3.9 https://github.com/microsoft/recommenders/actions/runs/4376825271/jobs/7659485596

Tried - DIDN'T WORK

env:
        PYSPARK_PYTHON: "python"
        PYSPARK_DRIVER_PYTHON: "python"

Same error: https://github.com/microsoft/recommenders/actions/runs/4377236099/jobs/7660442105

Trying to find the current python path to add it hardcoded. It is /opt/hostedtoolcache/Python/3.8.16/x64/bin/python it seems this is equal to $Python_ROOT_DIR/bin/python
tried -> it didn't work https://github.com/microsoft/recommenders/actions/runs/4386748345/jobs/7681200423

    - name: Submit PySpark tests to AzureML
      shell: bash
      if: contains(inputs.TEST_GROUP, 'spark')
      run: |
          export PYSPARK_PYTHON=`which python`
          export PYSPARK_DRIVER_PYTHON=`which python`
          python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername ${{inputs.CPU_CLUSTER_NAME}} \
          --subid ${{inputs.AZUREML_TEST_SUBID}} --reponame "recommenders" --branch ${{ github.ref }} \
          --rg ${{inputs.RG}} --wsname ${{inputs.WS}} --expname ${{inputs.EXP_NAME}}_${{inputs.TEST_GROUP}} \
          --testlogs ${{inputs.TEST_LOGS_PATH}} --add_spark_dependencies --testkind ${{inputs.TEST_KIND}} \
          --conda_pkg_python ${{inputs.PYTHON_VERSION}} --testgroup ${{inputs.TEST_GROUP}} \
          --disable-warnings

Tried - DIDN'T WORK

    - name: Submit PySpark tests to AzureML
      shell: bash
      if: contains(inputs.TEST_GROUP, 'spark')
      run: >-
          python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername ${{inputs.CPU_CLUSTER_NAME}}
          --subid ${{inputs.AZUREML_TEST_SUBID}} --reponame "recommenders" --branch ${{ github.ref }}
          --rg ${{inputs.RG}} --wsname ${{inputs.WS}} --expname ${{inputs.EXP_NAME}}_${{inputs.TEST_GROUP}}
          --testlogs ${{inputs.TEST_LOGS_PATH}} --add_spark_dependencies --testkind ${{inputs.TEST_KIND}}
          --conda_pkg_python ${{inputs.PYTHON_VERSION}} --testgroup ${{inputs.TEST_GROUP}}
          --disable-warnings
      env:
        PYSPARK_PYTHON: "$Python_ROOT_DIR/bin/python"
        PYSPARK_DRIVER_PYTHON: "$Python_ROOT_DIR/bin/python"

error: https://github.com/microsoft/recommenders/actions/runs/4386748345/jobs/7681200423

Try setting up java sdk in actions:
https://github.com/microsoft/recommenders/blob/1.1.0/.github/workflows/pr-gate.yml#L140 or https://github.com/microsoft/recommenders/blob/1.1.0/.github/workflows/nightly.yml#L130
Tried - DIDN'T WORK

    - name: Install Spark dependencies
      if: contains(inputs.TEST_GROUP, 'spark')
      uses: actions/[email protected]
      with:
        java-version: 11
        distribution: 'adopt'
    - name: Submit PySpark tests to AzureML
      shell: bash
      if: contains(inputs.TEST_GROUP, 'spark')
      run: >-
          python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername ${{inputs.CPU_CLUSTER_NAME}}
          --subid ${{inputs.AZUREML_TEST_SUBID}} --reponame "recommenders" --branch ${{ github.ref }}
          --rg ${{inputs.RG}} --wsname ${{inputs.WS}} --expname ${{inputs.EXP_NAME}}_${{inputs.TEST_GROUP}}
          --testlogs ${{inputs.TEST_LOGS_PATH}} --add_spark_dependencies --testkind ${{inputs.TEST_KIND}}
          --conda_pkg_python ${{inputs.PYTHON_VERSION}} --testgroup ${{inputs.TEST_GROUP}}
          --disable-warnings
      env:
        PYSPARK_PYTHON: "$Python_ROOT_DIR/bin/python"
        PYSPARK_DRIVER_PYTHON: "$Python_ROOT_DIR/bin/python"

error: https://github.com/microsoft/recommenders/actions/runs/4449830048/jobs/7814542250

Tried

    - name: Install Spark dependencies
      if: contains(inputs.TEST_GROUP, 'spark')
      uses: actions/[email protected]
      with:
        java-version: 11
        distribution: 'adopt'
    - name: Submit PySpark tests to AzureML
      shell: bash
      if: contains(inputs.TEST_GROUP, 'spark')
      run: run: |
          export PYSPARK_PYTHON=`which python`
          export PYSPARK_DRIVER_PYTHON=`which python`
          python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername ${{inputs.CPU_CLUSTER_NAME}} \
          --subid ${{inputs.AZUREML_TEST_SUBID}} --reponame "recommenders" --branch ${{ github.ref }} \
          --rg ${{inputs.RG}} --wsname ${{inputs.WS}} --expname ${{inputs.EXP_NAME}}_${{inputs.TEST_GROUP}} \
          --testlogs ${{inputs.TEST_LOGS_PATH}} --add_spark_dependencies --testkind ${{inputs.TEST_KIND}} \
          --conda_pkg_python ${{inputs.PYTHON_VERSION}} --testgroup ${{inputs.TEST_GROUP}} \
          --disable-warnings

@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Mar 17, 2023

modifying the file that sends the job to the azureml machine submit_groupwise_azureml_pytest.py

Tried remove openjdk - DIDN'T WORK

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        # conda_dep.add_conda_package(conda_pkg_jdk) # "openjdk=8"
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")

error: https://github.com/microsoft/recommenders/actions/runs/4450276641/jobs/7815538227#step:17:36974

Maybe try to add the env variables inside the docker image that goes to the VM? Tried to add in submit_groupwise_azureml_pytest.py:

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        run_azuremlcompute.environment.environment_variables = {
            "PYSPARK_PYTHON": "python3",
            "PYSPARK_DRIVER_PYTHON": "python3",
        }

Tried:

elif add_spark_dependencies:
        run_azuremlcompute.environment.spark.enabled = True
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        # run_azuremlcompute.environment.environment_variables = {
        #     "PYSPARK_PYTHON": "python3",
        #     "PYSPARK_DRIVER_PYTHON": "python3",
        # }

error:

  File "tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py", line 209, in create_run_config
    run_azuremlcompute.environment.spark.enabled = True
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/azureml/_base_sdk_common/abstract_run_config_element.py", line 26, in __setattr__
    raise AttributeError("*** has no attribute ***".format(self.__class__, name))
AttributeError: <class 'azureml.core.environment.SparkSection'> has no attribute enabled

Tried:

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        run_azuremlcompute.environment.environment_variables = {
            "PYSPARK_PYTHON": "python",
            "PYSPARK_DRIVER_PYTHON": "python",
        }

If we go to the VM that is being executed, and go to Run raw JSON, we can see that the environment variables are being filled up:

        "environment": {
            "name": "default-environment",
            "version": "Autosave_2023-03-23T10:45:41Z_ca8bf1e2",
            "assetId": "azureml://locations/eastus/workspaces/978a92da-a2ad-4447-aae1-b21196dd4a9b/environments/default-environment/versions/Autosave_2023-03-23T10:45:41Z_ca8bf1e2",
            "autoRebuild": true,
            "python": {
                "interpreterPath": "python",
                "userManagedDependencies": false,
                "condaDependencies": {
                    "name": "project_environment",
                    "dependencies": [
                        "python=3.7",
                        {
                            "pip": [
                                "azureml-defaults",
                                "pymanopt@https://github.com/pymanopt/pymanopt/archive/fb36a272cdeecb21992cfd9271eb82baafeb316d.zip",
                                "recommenders[dev,examples,spark]"
                            ]
                        },
                        "openjdk=8"
                    ],
                    "channels": [
                        "anaconda",
                        "conda-forge"
                    ]
                },
                "baseCondaEnvironment": null
            },
            "environmentVariables": {
                "PYSPARK_PYTHON": "python",
                "PYSPARK_DRIVER_PYTHON": "python"
            },

Trying to identify where is the python path:

In action.yaml:

    - name: Submit PySpark tests to AzureML
      shell: bash
      if: contains(inputs.TEST_GROUP, 'spark')
      run: |
          ls -lha `which python`
          python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername ${{inputs.CPU_CLUSTER_NAME}} \
          --subid ${{inputs.AZUREML_TEST_SUBID}} --reponame "recommenders" --branch ${{ github.ref }} \
          --rg ${{inputs.RG}} --wsname ${{inputs.WS}} --expname ${{inputs.EXP_NAME}}_${{inputs.TEST_GROUP}} \
          --testlogs ${{inputs.TEST_LOGS_PATH}} --add_spark_dependencies --testkind ${{inputs.TEST_KIND}} \
          --conda_pkg_python ${{inputs.PYTHON_VERSION}} --testgroup ${{inputs.TEST_GROUP}} \
          --disable-warnings

also in submit_groupwise_azureml_pytest.py:

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        run_azuremlcompute.environment.environment_variables = {
            "PYSPARK_PYTHON": "`which python`",
            "PYSPARK_DRIVER_PYTHON": "`which python`",
        }

Output of path:

  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.8.16/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.8.16/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.16/x64/lib
    AZURE_HTTP_USER_AGENT: 
    AZUREPS_HOST_ENVIRONMENT: 
lrwxrwxrwx 1 runner runneradmin 9 Mar 13 22:36 /opt/hostedtoolcache/Python/3.8.16/x64/bin/python -> python3.8

Tried:

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        run_azuremlcompute.environment.environment_variables = {
            "PYSPARK_PYTHON": "$Python_ROOT_DIR/bin/python",
            "PYSPARK_DRIVER_PYTHON": "$Python_ROOT_DIR/bin/python",
        }

When I echo the variables, they are not there:

Run echo $PYSPARK_PYTHON
  echo $PYSPARK_PYTHON
  echo $PYSPARK_DRIVER_PYTHON
  env
  python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername cpu-cluster \
  --subid *** --reponame "recommenders" --branch refs/heads/miguel/error_spark_benchmark \
  --rg recommenders_project_resources --wsname azureml-test-workspace --expname nightly_tests_group_spark_001 \
  --testlogs "test_logs.log" --add_spark_dependencies --testkind nightly \
  --conda_pkg_python "python=3.7" --testgroup group_spark_001 \
  --disable-warnings
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail ***0***
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.8.16/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.8.16/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.16/x64/lib
    AZURE_HTTP_USER_AGENT: 
    AZUREPS_HOST_ENVIRONMENT: 


SELENIUM_JAR_PATH=/usr/share/java/selenium-server.jar
GOROOT_1_17_X64=/opt/hostedtoolcache/go/1.17.13/x64
CONDA=/usr/share/miniconda
GITHUB_WORKSPACE=/home/runner/work/recommenders/recommenders
JAVA_HOME_11_X64=/usr/lib/jvm/temurin-11-jdk-amd64
PKG_CONFIG_PATH=/opt/hostedtoolcache/Python/3.8.16/x64/lib/pkgconfig
GITHUB_PATH=/home/runner/work/_temp/_runner_file_commands/add_path_60fabb49-b8e1-4646-a0de-b82d2bcb97a5
GITHUB_ACTION=execute_tests
JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64
GITHUB_RUN_NUMBER=250
RUNNER_NAME=GitHub Actions 111
GRADLE_HOME=/usr/share/gradle-8.0.2
GITHUB_REPOSITORY_OWNER_ID=6154722
XDG_CONFIG_HOME=/home/runner/.config
Python_ROOT_DIR=/opt/hostedtoolcache/Python/3.8.16/x64
DOTNET_SKIP_FIRST_TIME_EXPERIENCE=1
ANT_HOME=/usr/share/ant
JAVA_HOME_8_X64=/usr/lib/jvm/temurin-8-jdk-amd64
GITHUB_TRIGGERING_ACTOR=miguelgfierro
pythonLocation=/opt/hostedtoolcache/Python/3.8.16/x64
GITHUB_REF_TYPE=branch
HOMEBREW_CLEANUP_PERIODIC_FULL_DAYS=3650
ANDROID_NDK=/usr/local/lib/android/sdk/ndk/25.2.9519653
BOOTSTRAP_HASKELL_NONINTERACTIVE=1
***
PIPX_BIN_DIR=/opt/pipx_bin
GOROOT_1_20_X64=/opt/hostedtoolcache/go/1.20.2/x64
GITHUB_REPOSITORY_ID=149430917
DEPLOYMENT_BASEPATH=/opt/runner
GITHUB_ACTIONS=true
ANDROID_NDK_LATEST_HOME=/usr/local/lib/android/sdk/ndk/25.2.9519653
SYSTEMD_EXEC_PID=668
GITHUB_SHA=cb69c9069ea51d911fd[786](https://github.com/microsoft/recommenders/actions/runs/4499953310/jobs/7918465948#step:3:794)[796](https://github.com/microsoft/recommenders/actions/runs/4499953310/jobs/7918465948#step:3:804)5e05acfdda96053
GITHUB_WORKFLOW_REF=microsoft/recommenders/.github/workflows/azureml-spark-nightly.yml@refs/heads/miguel/error_spark_benchmark
POWERSHELL_DISTRIBUTION_CHANNEL=GitHub-Actions-ubuntu22
DOTNET_MULTILEVEL_LOOKUP=0
GITHUB_REF=refs/heads/miguel/error_spark_benchmark
RUNNER_OS=Linux
GITHUB_REF_PROTECTED=false
HOME=/home/runner
GITHUB_API_URL=https://api.github.com/
LANG=C.UTF-8
RUNNER_TRACKING_ID=github_524d97bc-750e-460e-a556-8a8f1aa03f57
RUNNER_ARCH=X64
RUNNER_TEMP=/home/runner/work/_temp
GITHUB_STATE=/home/runner/work/_temp/_runner_file_commands/save_state_60fabb49-b8e1-4646-a0de-b82d2bcb97a5
EDGEWEBDRIVER=/usr/local/share/edge_driver
GITHUB_ENV=/home/runner/work/_temp/_runner_file_commands/set_env_60fabb49-b8e1-4646-a0de-b82d2bcb97a5
GITHUB_EVENT_PATH=/home/runner/work/_temp/_github_workflow/event.json
INVOCATION_ID=22fcd496ebde4f6dbd425ef9ad57d2ac
GITHUB_EVENT_NAME=workflow_dispatch
GITHUB_RUN_ID=4499953310
JAVA_HOME_17_X64=/usr/lib/jvm/temurin-17-jdk-amd64
ANDROID_NDK_HOME=/usr/local/lib/android/sdk/ndk/25.2.9519653
GITHUB_STEP_SUMMARY=/home/runner/work/_temp/_runner_file_commands/step_summary_60fabb49-b8e1-4646-a0de-b82d2bcb97a5
HOMEBREW_NO_AUTO_UPDATE=1
GITHUB_ACTOR=miguelgfierro
NVM_DIR=/home/runner/.nvm
SGX_AESM_ADDR=1
GITHUB_RUN_ATTEMPT=1
STATS_RDCL=true
ANDROID_HOME=/usr/local/lib/android/sdk
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
ACCEPT_EULA=Y
RUNNER_USER=runner
USER=runner
GITHUB_ACTION_PATH=/home/runner/work/recommenders/recommenders/./.github/actions/azureml-test
GITHUB_SERVER_URL=https://github.com/
PIPX_HOME=/opt/pipx
GECKOWEBDRIVER=/usr/local/share/gecko_driver
STATS_NM=true
CHROMEWEBDRIVER=/usr/local/share/chrome_driver
SHLVL=1
ANDROID_SDK_ROOT=/usr/local/lib/android/sdk
VCPKG_INSTALLATION_ROOT=/usr/local/share/vcpkg
GITHUB_ACTOR_ID=3491412
RUNNER_TOOL_CACHE=/opt/hostedtoolcache
ImageVersion=20230313.1
AZURE_HTTP_USER_AGENT=
Python3_ROOT_DIR=/opt/hostedtoolcache/Python/3.8.16/x64
DOTNET_NOLOGO=1
GITHUB_WORKFLOW_SHA=cb69c9069ea51d911fd7[867](https://github.com/microsoft/recommenders/actions/runs/4499953310/jobs/7918465948#step:3:875)965e05acfdda96053
GITHUB_REF_NAME=miguel/error_spark_benchmark
GRAALVM_11_ROOT=/usr/local/graalvm/graalvm-ce-java11-22.3.1
GITHUB_JOB=execute-tests
LD_LIBRARY_PATH=/opt/hostedtoolcache/Python/3.8.16/x64/lib
XDG_RUNTIME_DIR=/run/user/1001
AZURE_EXTENSION_DIR=/opt/az/azcliextensions
PERFLOG_LOCATION_SETTING=RUNNER_PERFLOG
GITHUB_REPOSITORY=microsoft/recommenders
Python2_ROOT_DIR=/opt/hostedtoolcache/Python/3.8.16/x64
ANDROID_NDK_ROOT=/usr/local/lib/android/sdk/ndk/25.2.9519653
CHROME_BIN=/usr/bin/google-chrome
GOROOT_1_18_X64=/opt/hostedtoolcache/go/1.18.10/x64
GITHUB_RETENTION_DAYS=90
JOURNAL_STREAM=8:16340
RUNNER_WORKSPACE=/home/runner/work/recommenders
LEIN_HOME=/usr/local/lib/lein
LEIN_JAR=/usr/local/lib/lein/self-installs/leiningen-2.10.0-standalone.jar
GITHUB_ACTION_REPOSITORY=
PATH=/opt/hostedtoolcache/Python/3.8.16/x64/bin:/opt/hostedtoolcache/Python/3.8.16/x64:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/snap/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
RUNNER_PERFLOG=/home/runner/perflog
GITHUB_BASE_REF=
GHCUP_INSTALL_BASE_PREFIX=/usr/local
CI=true
SWIFT_PATH=/usr/share/swift/usr/bin
ImageOS=ubuntu22
GITHUB_REPOSITORY_OWNER=microsoft
GITHUB_HEAD_REF=
GITHUB_ACTION_REF=
GOROOT_1_19_X64=/opt/hostedtoolcache/go/1.19.7/x64
GITHUB_WORKFLOW=azureml-spark-nightly
DEBIAN_FRONTEND=noninteractive
GITHUB_OUTPUT=/home/runner/work/_temp/_runner_file_commands/set_output_60fabb49-b8e1-4646-a0de-b82d2bcb97a5
AGENT_TOOLSDIRECTORY=/opt/hostedtoolcache
AZUREPS_HOST_ENVIRONMENT=
_=/usr/bin/env

New error:

E           : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (localhost executor driver): java.io.IOException: Cannot run program "$Python_ROOT_DIR/bin/python": error=2, No such file or directory

@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Mar 23, 2023

In action.yaml:

    - name: Submit PySpark tests to AzureML
      shell: bash
      if: contains(inputs.TEST_GROUP, 'spark')
      run: |
          echo $PYSPARK_PYTHON
          echo $PYSPARK_DRIVER_PYTHON
          export PYSPARK_PYTHON=$Python_ROOT_DIR/bin/python
          export PYSPARK_DRIVER_PYTHON=$Python_ROOT_DIR/bin/python
          echo $PYSPARK_PYTHON
          echo $PYSPARK_DRIVER_PYTHON
          env
          python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername ${{inputs.CPU_CLUSTER_NAME}} \
          --subid ${{inputs.AZUREML_TEST_SUBID}} --reponame "recommenders" --branch ${{ github.ref }} \
          --rg ${{inputs.RG}} --wsname ${{inputs.WS}} --expname ${{inputs.EXP_NAME}}_${{inputs.TEST_GROUP}} \
          --testlogs ${{inputs.TEST_LOGS_PATH}} --add_spark_dependencies --testkind ${{inputs.TEST_KIND}} \
          --conda_pkg_python ${{inputs.PYTHON_VERSION}} --testgroup ${{inputs.TEST_GROUP}} \
          --disable-warnings

In submit_groupwise_azureml_pytest.py:

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        run_azuremlcompute.environment_variables = {
            "PYSPARK_PYTHON": "$Python_ROOT_DIR/bin/python",
            "PYSPARK_DRIVER_PYTHON": "$Python_ROOT_DIR/bin/python",
        }

Error https://github.com/microsoft/recommenders/actions/runs/4500157274/jobs/7918919866#step:17:2070, the variables are not being set

echo $PYSPARK_PYTHON
  echo $PYSPARK_DRIVER_PYTHON
  export PYSPARK_PYTHON=$Python_ROOT_DIR/bin/python
  export PYSPARK_DRIVER_PYTHON=$Python_ROOT_DIR/bin/python
  echo $PYSPARK_PYTHON
  echo $PYSPARK_DRIVER_PYTHON
  env
  python tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py --clustername cpu-cluster \
  --subid *** --reponame "recommenders" --branch refs/heads/miguel/error_spark_benchmark \
  --rg recommenders_project_resources --wsname azureml-test-workspace --expname nightly_tests_group_spark_001 \
  --testlogs "test_logs.log" --add_spark_dependencies --testkind nightly \
  --conda_pkg_python "python=3.8" --testgroup group_spark_001 \
  --disable-warnings
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail ***0***
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.8.16/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.8.16/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.8.16/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.16/x64/lib
    AZURE_HTTP_USER_AGENT: 
    AZUREPS_HOST_ENVIRONMENT: 


/opt/hostedtoolcache/Python/3.8.16/x64/bin/python
/opt/hostedtoolcache/Python/3.8.16/x64/bin/python
SELENIUM_JAR_PATH=/usr/share/java/selenium-server.jar
GOROOT_1_17_X64=/opt/hostedtoolcache/go/1.17.13/x64
CONDA=/usr/share/miniconda
GITHUB_WORKSPACE=/home/runner/work/recommenders/recommenders
JAVA_HOME_11_X64=/usr/lib/jvm/temurin-11-jdk-amd64
PKG_CONFIG_PATH=/opt/hostedtoolcache/Python/3.8.16/x64/lib/pkgconfig
GITHUB_PATH=/home/runner/work/_temp/_runner_file_commands/add_path_aa640891-66a6-43ce-ac92-ede626bce8a0
GITHUB_ACTION=execute_tests
PYSPARK_DRIVER_PYTHON=/opt/hostedtoolcache/Python/3.8.16/x64/bin/python
JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64
GITHUB_RUN_NUMBER=252
RUNNER_NAME=GitHub Actions 35
GRADLE_HOME=/usr/share/gradle-8.0.2
GITHUB_REPOSITORY_OWNER_ID=6154722
XDG_CONFIG_HOME=/home/runner/.config
Python_ROOT_DIR=/opt/hostedtoolcache/Python/3.8.16/x64
DOTNET_SKIP_FIRST_TIME_EXPERIENCE=1
ANT_HOME=/usr/share/ant
JAVA_HOME_8_X64=/usr/lib/jvm/temurin-8-jdk-amd64
GITHUB_TRIGGERING_ACTOR=miguelgfierro
pythonLocation=/opt/hostedtoolcache/Python/3.8.16/x64
GITHUB_REF_TYPE=branch
HOMEBREW_CLEANUP_PERIODIC_FULL_DAYS=3650
ANDROID_NDK=/usr/local/lib/android/sdk/ndk/25.2.9519653
BOOTSTRAP_HASKELL_NONINTERACTIVE=1
***
PIPX_BIN_DIR=/opt/pipx_bin
GOROOT_1_20_X64=/opt/hostedtoolcache/go/1.20.2/x64
GITHUB_REPOSITORY_ID=149430917
DEPLOYMENT_BASEPATH=/opt/runner
GITHUB_ACTIONS=true
ANDROID_NDK_LATEST_HOME=/usr/local/lib/android/sdk/ndk/25.2.9519653
SYSTEMD_EXEC_PID=663
GITHUB_SHA=593e4d01308ffff87f617e099e24841731f3da8c
GITHUB_WORKFLOW_REF=microsoft/recommenders/.github/workflows/azureml-spark-nightly.yml@refs/heads/miguel/error_spark_benchmark
POWERSHELL_DISTRIBUTION_CHANNEL=GitHub-Actions-ubuntu22
DOTNET_MULTILEVEL_LOOKUP=0
GITHUB_REF=refs/heads/miguel/error_spark_benchmark
RUNNER_OS=Linux
GITHUB_REF_PROTECTED=false
HOME=/home/runner
GITHUB_API_URL=https://api.github.com
LANG=C.UTF-8
RUNNER_TRACKING_ID=github_525a3fde-540d-4c15-ac94-ca6cbfb5f6e1
RUNNER_ARCH=X64
RUNNER_TEMP=/home/runner/work/_temp
GITHUB_STATE=/home/runner/work/_temp/_runner_file_commands/save_state_aa640891-66a6-43ce-ac92-ede626bce8a0
EDGEWEBDRIVER=/usr/local/share/edge_driver
GITHUB_ENV=/home/runner/work/_temp/_runner_file_commands/set_env_aa640891-66a6-43ce-ac92-ede626bce8a0
GITHUB_EVENT_PATH=/home/runner/work/_temp/_github_workflow/event.json
INVOCATION_ID=daf2633e885845d48d7ee646318998ae
GITHUB_EVENT_NAME=workflow_dispatch
GITHUB_RUN_ID=4500157274
JAVA_HOME_17_X64=/usr/lib/jvm/temurin-17-jdk-amd64
ANDROID_NDK_HOME=/usr/local/lib/android/sdk/ndk/25.2.9519653
GITHUB_STEP_SUMMARY=/home/runner/work/_temp/_runner_file_commands/step_summary_aa640891-66a6-43ce-ac92-ede626bce8a0
HOMEBREW_NO_AUTO_UPDATE=1
GITHUB_ACTOR=miguelgfierro
NVM_DIR=/home/runner/.nvm
PYSPARK_PYTHON=/opt/hostedtoolcache/Python/3.8.16/x64/bin/python
SGX_AESM_ADDR=1
GITHUB_RUN_ATTEMPT=1
STATS_RDCL=true
ANDROID_HOME=/usr/local/lib/android/sdk
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
ACCEPT_EULA=Y
RUNNER_USER=runner
USER=runner
GITHUB_ACTION_PATH=/home/runner/work/recommenders/recommenders/./.github/actions/azureml-test
GITHUB_SERVER_URL=https://github.com
PIPX_HOME=/opt/pipx
GECKOWEBDRIVER=/usr/local/share/gecko_driver
STATS_NM=true
CHROMEWEBDRIVER=/usr/local/share/chrome_driver
SHLVL=1
ANDROID_SDK_ROOT=/usr/local/lib/android/sdk
VCPKG_INSTALLATION_ROOT=/usr/local/share/vcpkg
GITHUB_ACTOR_ID=3491412
RUNNER_TOOL_CACHE=/opt/hostedtoolcache
ImageVersion=202[303](https://github.com/microsoft/recommenders/actions/runs/4500157274/jobs/7918919866#step:3:310)17.1
AZURE_HTTP_USER_AGENT=
Python3_ROOT_DIR=/opt/hostedtoolcache/Python/3.8.16/x64
DOTNET_NOLOGO=1
GITHUB_WORKFLOW_SHA=593e4d01[308](https://github.com/microsoft/recommenders/actions/runs/4500157274/jobs/7918919866#step:3:315)ffff87f617e099e24841731f3da8c
GITHUB_REF_NAME=miguel/error_spark_benchmark
GRAALVM_11_ROOT=/usr/local/graalvm/graalvm-ce-java11-22.3.1
GITHUB_JOB=execute-tests
LD_LIBRARY_PATH=/opt/hostedtoolcache/Python/3.8.16/x64/lib
XDG_RUNTIME_DIR=/run/user/1001
AZURE_EXTENSION_DIR=/opt/az/azcliextensions
PERFLOG_LOCATION_SETTING=RUNNER_PERFLOG
GITHUB_REPOSITORY=microsoft/recommenders
Python2_ROOT_DIR=/opt/hostedtoolcache/Python/3.8.16/x64
ANDROID_NDK_ROOT=/usr/local/lib/android/sdk/ndk/25.2.9519653
CHROME_BIN=/usr/bin/google-chrome
GOROOT_1_18_X64=/opt/hostedtoolcache/go/1.18.10/x64
GITHUB_RETENTION_DAYS=90
JOURNAL_STREAM=8:17118
RUNNER_WORKSPACE=/home/runner/work/recommenders
LEIN_HOME=/usr/local/lib/lein
LEIN_JAR=/usr/local/lib/lein/self-installs/leiningen-2.10.0-standalone.jar
GITHUB_ACTION_REPOSITORY=
PATH=/opt/hostedtoolcache/Python/3.8.16/x64/bin:/opt/hostedtoolcache/Python/3.8.16/x64:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/snap/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
RUNNER_PERFLOG=/home/runner/perflog
GITHUB_BASE_REF=
GHCUP_INSTALL_BASE_PREFIX=/usr/local
CI=true
SWIFT_PATH=/usr/share/swift/usr/bin
ImageOS=ubuntu22
GITHUB_REPOSITORY_OWNER=microsoft
GITHUB_HEAD_REF=
GITHUB_ACTION_REF=
GOROOT_1_19_X64=/opt/hostedtoolcache/go/1.19.7/x64
GITHUB_WORKFLOW=azureml-spark-nightly
DEBIAN_FRONTEND=noninteractive
GITHUB_OUTPUT=/home/runner/work/_temp/_runner_file_commands/set_output_aa6[408](https://github.com/microsoft/recommenders/actions/runs/4500157274/jobs/7918919866#step:3:416)91-66a6-43ce-ac92-ede626bce8a0
AGENT_TOOLSDIRECTORY=/opt/hostedtoolcache
AZUREPS_HOST_ENVIRONMENT=
_=/usr/bin/env

but then there is also this error:

E           : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (localhost executor driver): java.io.IOException: Cannot run program "$Python_ROOT_DIR/bin/python": error=2, No such file or directory

If I comment the env variables:

    elif add_spark_dependencies:
        conda_dep.add_channel("conda-forge")
        conda_dep.add_conda_package(conda_pkg_jdk)
        conda_dep.add_pip_package("recommenders[dev,examples,spark]")
        # run_azuremlcompute.environment_variables = {
        #     "PYSPARK_PYTHON": "$Python_ROOT_DIR/bin/python",
        #     "PYSPARK_DRIVER_PYTHON": "$Python_ROOT_DIR/bin/python",
        # }

Error: https://github.com/microsoft/recommenders/actions/runs/4500303767/jobs/7919262092 -> Connection Refused. This time it didn't say anything about "$Python_ROOT_DIR/bin/python": error=2, No such file or directory, because it wasn't set with the env variables.

Try: recreate the cluster with SSH access and try again

@miguelgfierro
Copy link
Collaborator Author

from @loomlike:

This seems to get the spark executor, but still he gets a Stack overflow error

   script_run_config = ScriptRunConfig(
       source_directory=".",
       # script=test,
       run_config=run_config,
       # arguments=arguments,
       # FIXME
       command=f'export PYSPARK_PYTHON=$(which python) && export PYSPARK_DRIVER_PYTHON=$(which python) && unset SPARK_HOME && python {test}'.split() + arguments,
       # docker_runtime_config=dc
   )

error:

Py4JJavaError: An error occurred while calling o352.javaToPython.
E           : java.lang.StackOverflowError
E              

E           --> 382     rank_eval = SparkRankingEvaluation(

@loomlike loomlike mentioned this issue Mar 30, 2023
2 tasks
miguelgfierro added a commit that referenced this issue Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant