Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43868: [CI][Python] Skip test that requires PARQUET_TEST_DATA env on emscripten #43906

Merged
merged 5 commits into from
Sep 18, 2024

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Sep 2, 2024

Rationale for this change

The following PR:

Made mandatory for a test the requirement to have PARQUET_TEST_DATA env defined.

This is currently not available from python_test_emscripten.sh as we require to mount the filesystem for both Node and ChromeDriver.

What changes are included in this PR?

Skip the test that requires PARQUET_TEST_DATA for emscripten.

Are these changes tested?

Via archery

Are there any user-facing changes?

No

@raulcd
Copy link
Member Author

raulcd commented Sep 2, 2024

@github-actions crossbow submit test-conda-python-emscripten

Copy link

github-actions bot commented Sep 2, 2024

⚠️ GitHub issue #43905 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Sep 2, 2024
Copy link

github-actions bot commented Sep 2, 2024

Revision: 3091b51

Submitted crossbow builds: ursacomputing/crossbow @ actions-d04eba703d

Task Status
test-conda-python-emscripten GitHub Actions

@raulcd
Copy link
Member Author

raulcd commented Sep 2, 2024

ok, this does not work on emscripten. I have no idea how environment variables work there and / or if they work. We probably should skip this test on that environment. I am also not sure if we should just delete this check, my expectation is that if a test requires this to be set will already fail if it does not find the parquet test data files. @jorisvandenbossche should we just remove this check?

if not result:
raise RuntimeError('Please point the PARQUET_TEST_DATA environment '
'variable to the test data directory')

edit: added link to check

@pitrou
Copy link
Member

pitrou commented Sep 2, 2024

cc @kou for insight.

@kou
Copy link
Member

kou commented Sep 3, 2024

I think that we can't inject Parquet test data directory path by environment variable.
Because environment variable doesn't exist on Web browser.

But... we may be able to use https://github.com/pyodide/pyodide/blob/a448c045cf5640c5050e0c3722bafc0d58659fcc/src/js/pyodide.ts#L151-L160 :

diff --git a/python/scripts/run_emscripten_tests.py b/python/scripts/run_emscripten_tests.py
index 1a4b4a4e05..79996b1f92 100644
--- a/python/scripts/run_emscripten_tests.py
+++ b/python/scripts/run_emscripten_tests.py
@@ -19,6 +19,7 @@
 import argparse
 import contextlib
 import http.server
+import json
 import os
 import queue
 import shutil
@@ -153,7 +154,7 @@ class NodeDriver:
         self.execute_js(
             f"""
         const {{ loadPyodide }} = require('{dist_dir}/pyodide.js');
-        let pyodide = await loadPyodide();
+        let pyodide = await loadPyodide({json.dumps({"env": {"PARQUET_TEST_DATA": os.environ.get("PARQUET_TEST_DATA")}})});
         """
         )
 

@raulcd
Copy link
Member Author

raulcd commented Sep 3, 2024

@github-actions crossbow submit test-conda-python-emscripten

@raulcd raulcd changed the title GH-43905: [CI][Python] Add required PARQUET_TEST_DATA env to python_test_emscripten.sh GH-43868: [CI][Python] Add required PARQUET_TEST_DATA env to python_test_emscripten.sh Sep 3, 2024
Copy link

github-actions bot commented Sep 3, 2024

⚠️ GitHub issue #43868 has been automatically assigned in GitHub to PR creator.

Copy link

github-actions bot commented Sep 3, 2024

Revision: 3d3708a

Submitted crossbow builds: ursacomputing/crossbow @ actions-8e719208e8

Task Status
test-conda-python-emscripten GitHub Actions

@raulcd
Copy link
Member Author

raulcd commented Sep 3, 2024

I was testing locally and it seems that even though the file exists at the location inside the container:

root@4084a63715db:/# ls -lrt /arrow/cpp/submodules/parquet-testing/data/column_chunk_key_value_metadata.parquet 
-rw-rw-r-- 1 1000 1000 400 Sep  2 11:59 /arrow/cpp/submodules/parquet-testing/data/column_chunk_key_value_metadata.parquet

The test fails:

_____________________ test_column_chunk_key_value_metadata _____________________

parquet_test_datadir = PosixPath('/arrow/cpp/submodules/parquet-testing/data')

    def test_column_chunk_key_value_metadata(parquet_test_datadir):
>       metadata = pq.read_metadata(parquet_test_datadir /
                                    'column_chunk_key_value_metadata.parquet')

/lib/python3.12/site-packages/pyarrow/tests/parquet/test_metadata.py:788: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/lib/python3.12/site-packages/pyarrow/parquet/core.py:2297: in read_metadata
    file_ctx = where = filesystem.open_input_file(where)
pyarrow/_fs.pyx:789: in pyarrow._fs.FileSystem.open_input_file
    ???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   FileNotFoundError: [Errno 44] Failed to open local file '/arrow/cpp/submodules/parquet-testing/data/column_chunk_key_value_metadata.parquet'. Detail: [errno 44] No such file or directory

pyarrow/error.pxi:92: FileNotFoundError

@jorisvandenbossche

This comment was marked as outdated.

@jorisvandenbossche
Copy link
Member

cc @joemarshall

@jorisvandenbossche
Copy link
Member

For setting the env variable, could we modify os.environ directly in python code? (eg in _load_pyarrow_in_runner)

@raulcd
Copy link
Member Author

raulcd commented Sep 3, 2024

@github-actions crossbow submit test-conda-python-emscripten

@raulcd
Copy link
Member Author

raulcd commented Sep 3, 2024

Pushing a test skip that works locally but happy if @joemarshall or someone else comes with a solution that does not require us to skip the test.

Copy link

github-actions bot commented Sep 3, 2024

Revision: 6a5c514

Submitted crossbow builds: ursacomputing/crossbow @ actions-e9434c5a2d

Task Status
test-conda-python-emscripten GitHub Actions

@jorisvandenbossche
Copy link
Member

Skipping the test is certainly fine with me

@pitrou
Copy link
Member

pitrou commented Sep 3, 2024

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Sep 3, 2024
@raulcd
Copy link
Member Author

raulcd commented Sep 4, 2024

I've been able to have the test passing locally for the NodeDriver with this (and some other minor changes):

diff --git a/python/scripts/run_emscripten_tests.py b/python/scripts/run_emscripten_tests.py
index 53d3dd5..550e1a0 100644
--- a/python/scripts/run_emscripten_tests.py
+++ b/python/scripts/run_emscripten_tests.py
@@ -19,6 +19,7 @@
 import argparse
 import contextlib
 import http.server
+import json
 import os
 import queue
 import shutil
@@ -153,7 +154,8 @@ class NodeDriver:
         self.execute_js(
             f"""
         const {{ loadPyodide }} = require('{dist_dir}/pyodide.js');
-        let pyodide = await loadPyodide();
+        let pyodide = await loadPyodide({json.dumps({"env": {"PARQUET_TEST_DATA": "/test-data"}})});
+        pyodide.mountNodeFS("/test-data", "{os.environ.get("PARQUET_TEST_DATA")}");
         """
         )

but the test still fails for the ChromeDriver, I tried a couple of things, but they just crash, so I am not sure how to do the mount for the ChromeDriver.

@raulcd
Copy link
Member Author

raulcd commented Sep 17, 2024

@github-actions crossbow submit test-conda-python-emscripten

Copy link

Revision: e918ffa

Submitted crossbow builds: ursacomputing/crossbow @ actions-295c35fb98

Task Status
test-conda-python-emscripten GitHub Actions

@raulcd raulcd marked this pull request as ready for review September 17, 2024 09:35
@raulcd
Copy link
Member Author

raulcd commented Sep 17, 2024

I am marking this as ready for review just skipping the test, I can create an issue as a follow up in order to mount PARQUET_TEST_DATA if someone wants to take that over. @pitrou @jorisvandenbossche are you ok skipping the test to fix CI?

@pitrou
Copy link
Member

pitrou commented Sep 17, 2024

@raulcd Can you fix this PR's title and description?

@raulcd raulcd changed the title GH-43868: [CI][Python] Add required PARQUET_TEST_DATA env to python_test_emscripten.sh GH-43868: [CI][Python] Skip test that requires PARQUET_TEST_DATA env on emscripten Sep 17, 2024
@raulcd
Copy link
Member Author

raulcd commented Sep 17, 2024

@raulcd Can you fix this PR's title and description?

Sorry about that, I forgot to update the PR title + desc after several iterations. Thanks for double checking @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit fe39c8f into apache:main Sep 18, 2024
18 checks passed
@kou kou removed the awaiting change review Awaiting change review label Sep 18, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Sep 18, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit fe39c8f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants