`IndexError` in `test_read_parquet_partitioned_filtered[True-files-pfilters1]` #15295

jakirkham · 2024-03-13T20:38:54Z

Seeing this test failure on CI:

=================================== FAILURES ===================================
_________ test_read_parquet_partitioned_filtered[True-files-pfilters1] _________
[gw1] linux -- Python 3.9.18 /opt/conda/envs/test/bin/python3.9

tmpdir = local('/tmp/pytest-of-root/pytest-0/popen-gw1/test_read_parquet_partitioned_3')
pfilters = [('b', '==', 'a'), ('c', '==', 1)], selection = 'files'
use_cat = True

    @pytest.mark.parametrize(
        "pfilters",
        [[("b", "==", "b")], [("b", "==", "a"), ("c", "==", 1)]],
    )
    @pytest.mark.parametrize("selection", ["directory", "files", "row-groups"])
    @pytest.mark.parametrize("use_cat", [True, False])
    def test_read_parquet_partitioned_filtered(
        tmpdir, pfilters, selection, use_cat
    ):
        path = str(tmpdir)
        size = 100
        df = cudf.DataFrame(
            {
                "a": np.arange(0, stop=size, dtype="int64"),
                "b": np.random.choice(list("abcd"), size=size),
                "c": np.random.choice(np.arange(4), size=size),
            }
        )
        df.to_parquet(path, partition_cols=["c", "b"])
    
        if selection == "files":
            # Pass in a list of paths
            fs = get_fs_token_paths(path)[0]
            read_path = fs.find(path)
            row_groups = None
        elif selection == "row-groups":
            # Pass in a list of paths AND row-group ids
            fs = get_fs_token_paths(path)[0]
            read_path = fs.find(path)
            row_groups = [[0] for p in read_path]
        else:
            # Pass in a directory path
            # (row-group selection not allowed in this case)
            read_path = path
            row_groups = None
    
        # Filter on partitioned columns
        expect = pd.read_parquet(read_path, filters=pfilters)
>       got = cudf.read_parquet(
            read_path,
            filters=pfilters,
            row_groups=row_groups,
            categorical_partitions=use_cat,
        )

tests/test_parquet.py:2144: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/test/lib/python3.9/site-packages/nvtx/nvtx.py:116: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/test/lib/python3.9/site-packages/cudf/io/parquet.py:577: in read_parquet
    df = _parquet_to_frame(
/opt/conda/envs/test/lib/python3.9/site-packages/nvtx/nvtx.py:116: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/test/lib/python3.9/site-packages/cudf/io/parquet.py:721: in _parquet_to_frame
    return _read_parquet(
/opt/conda/envs/test/lib/python3.9/site-packages/nvtx/nvtx.py:116: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/test/lib/python3.9/site-packages/cudf/io/parquet.py:831: in _read_parquet
    return libparquet.read_parquet(
parquet.pyx:124: in cudf._lib.parquet.read_parquet
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   IndexError: list index out of range

parquet.pyx:275: IndexError
-------- generated xml file: /__w/cudf/cudf/test-results/junit-cudf.xml --------

Edit: Seen recently in an unrelated Doxygen build fix ( #15289 )

jakirkham · 2024-03-13T20:39:19Z

Looks like this was run into by the Spark team recently as well

#15219 (comment)

xref #15295 Hoping to make this test easier to debug if the input data is deterministic Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15296

vyasr · 2024-05-17T20:19:54Z

Seems like this was fixed by #15296. Perhaps we were randomly generating invalid data on occasion. Feel free to reopen if we find a meaningful reproducer again.

github-project-automation bot added this to cuDF/Dask/Numba/UCX Mar 13, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Mar 13, 2024

jakirkham mentioned this issue Mar 13, 2024

Fix Doxygen check #15289

Merged

3 tasks

mroeschke mentioned this issue Mar 13, 2024

Make test_read_parquet_partitioned_filtered data deterministic #15296

Merged

3 tasks

vyasr closed this as completed May 17, 2024

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IndexError` in `test_read_parquet_partitioned_filtered[True-files-pfilters1]` #15295

`IndexError` in `test_read_parquet_partitioned_filtered[True-files-pfilters1]` #15295

jakirkham commented Mar 13, 2024 •

edited

Loading

jakirkham commented Mar 13, 2024

vyasr commented May 17, 2024

IndexError in test_read_parquet_partitioned_filtered[True-files-pfilters1] #15295

IndexError in test_read_parquet_partitioned_filtered[True-files-pfilters1] #15295

Comments

jakirkham commented Mar 13, 2024 • edited Loading

jakirkham commented Mar 13, 2024

vyasr commented May 17, 2024

`IndexError` in `test_read_parquet_partitioned_filtered[True-files-pfilters1]` #15295

`IndexError` in `test_read_parquet_partitioned_filtered[True-files-pfilters1]` #15295

jakirkham commented Mar 13, 2024 •

edited

Loading