Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Categorify operator to set the domain max correctly #1641

Merged
merged 3 commits into from
Aug 15, 2022

Conversation

oliverholworthy
Copy link
Member

@oliverholworthy oliverholworthy commented Aug 9, 2022

Goal

Reduce the resulting int_domain.max property by one on a ColumnSchema after transforming with Categorify. To match the data correctly.

Motivation / Context

This PR was motivated by work on NVIDIA-Merlin/Merlin#479

We are using the domain.max to compute the vocab size / cardinality when creating embedding tables in Merlin Models. This off-by-one error is resulting in some confusion when creating the correct shape embedding dimensions from pretrained embedding data.

Example

import uuid
import pandas as pd

from merlin.io import Dataset
from merlin.transforms import Workflow
from merlin.transforms.ops import Categorify

df = pd.DataFrame({"id": [str(uuid.uuid4()) for _ in range(2)]})
dataset = Dataset(df)

dataset

                                     id
0  fc5f18c4-919f-4496-9209-1ae34aa4230d
1  738f873f-5fa7-4345-9daa-b1f714c9f1aa

dataset.schema

  name tags   dtype  is_list  is_ragged
0   id   ()  object    False      False

After the Categorify op, these ids are transformed to integers {1, 2} with 0 reserved for out-of-vocabulary. So we have a cardinality of 3 (including the zero).

workflow = Workflow(["id"] >> Categorify())
transformed_dataset = workflow.fit_transform(dataset)

transformed_dataset:

   id
0   2
1   1

transformed_dataset.schema

  name                tags  dtype  is_list  is_ragged properties.num_buckets  \
0   id  (Tags.CATEGORICAL)  int64    False      False                   None   

   properties.freq_threshold  properties.max_size  properties.start_index  \
0                          0                    0                       0   

               properties.cat_path  properties.domain.min  \
0  .//categories/unique.id.parquet                      0   

   properties.domain.max properties.domain.name  \
0                      3                     id   

   properties.embedding_sizes.cardinality  \
0                                       3   

   properties.embedding_sizes.dimension  
0                                    16  

However, with the current implementation the int_domain.max value after the transform in this example is 3. This is the same value as the cardinality. However, the maximum integer value is one less than the cardinality here which is 2.

@oliverholworthy oliverholworthy added the bug Something isn't working label Aug 9, 2022
@oliverholworthy oliverholworthy added this to the Merlin 22.08 milestone Aug 9, 2022
@oliverholworthy oliverholworthy self-assigned this Aug 9, 2022
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #1641 of commit 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4, no merge conflicts.
Running as SYSTEM
Setting status of 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4615/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4^{commit} # timeout=10
Checking out Revision 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10
Commit message: "Update `Categorify` operator to set the domain max correctly"
 > git rev-list --no-walk c2a5b743c7a0b458be7af4ca96da091887a044b9 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins14816013642204511087.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items

tests/unit/test_dask_nvt.py ..........................F..F..........FF.F [ 3%]
...F..............................................................FFF... [ 8%]
.... [ 8%]
tests/unit/test_notebooks.py ...... [ 8%]
tests/unit/test_s3.py FF [ 8%]
tests/unit/test_tf4rec.py . [ 9%]
tests/unit/test_tools.py ...................... [ 10%]
tests/unit/test_triton_inference.py ................................ [ 12%]
tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%]
tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]
................................................... [ 18%]
tests/unit/framework_utils/test_torch_layers.py . [ 18%]
tests/unit/loader/test_dataloader_backend.py ...... [ 18%]
tests/unit/loader/test_tf_dataloader.py ................................ [ 21%]
........................................s.. [ 24%]
tests/unit/loader/test_torch_dataloader.py ............................. [ 26%]
...................................................... [ 29%]
tests/unit/ops/test_categorify.py ...................................... [ 32%]
........................................................................ [ 37%]
........................................... [ 40%]
tests/unit/ops/test_column_similarity.py ........................ [ 42%]
tests/unit/ops/test_drop_low_cardinality.py FF [ 42%]
tests/unit/ops/test_fill.py ............................................ [ 45%]
........ [ 45%]
tests/unit/ops/test_groupyby.py ..................... [ 47%]
tests/unit/ops/test_hash_bucket.py ......................... [ 49%]
tests/unit/ops/test_join.py ............................................ [ 52%]
........................................................................ [ 57%]
.................................. [ 59%]
tests/unit/ops/test_lambda.py .......... [ 60%]
tests/unit/ops/test_normalize.py ....................................... [ 63%]
.. [ 63%]
tests/unit/ops/test_ops.py ............................................. [ 66%]
.................... [ 67%]
tests/unit/ops/test_ops_schema.py ...................................... [ 70%]
........................................................................ [ 75%]
........................................................................ [ 80%]
........................................................................ [ 85%]
....................................... [ 88%]
tests/unit/ops/test_reduce_dtype_size.py .. [ 88%]
tests/unit/ops/test_target_encode.py ..................... [ 89%]
tests/unit/workflow/test_cpu_workflow.py FFFFFF [ 90%]
tests/unit/workflow/test_workflow.py ................................... [ 92%]
.......................................................... [ 96%]
tests/unit/workflow/test_workflow_chaining.py ... [ 96%]
tests/unit/workflow/test_workflow_node.py ........... [ 97%]
tests/unit/workflow/test_workflow_ops.py ... [ 97%]
tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]
... [100%]

=================================== FAILURES ===================================
____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr26')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'
cat_cache = 'device', on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:18:30,251 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-1852321342b43b35c1c4d664628b409a', 0)
Function: subgraph_callable-11603efe-dc29-4a19-8a57-01e58196
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr26/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] ___

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr29')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'
cat_cache = 'device', on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:18:32,258 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-7dbace1faf4aac58bf4b5a9808158f3e', 1)
Function: subgraph_callable-ddf3578b-55e1-4bac-ad05-4a5c5fa7
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr29/processed/part_1.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

_______ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-0.1] ________

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr40')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv'
cat_cache = 'device', on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:18:38,567 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-60a11de382c15140e4b6043c3fd932a0', 0)
Function: subgraph_callable-5674b3d5-3fa9-4e20-8649-a7fb3f72
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr40/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr41')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'
cat_cache = 'device', on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:18:39,541 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-c58c6d8297d1363e166bbae0ba2b7cbc', 0)
Function: subgraph_callable-d0c1022a-5b14-4dfd-96bc-4d734a7f
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr41/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

_________ test_dask_workflow_api_dlrm[True-None-False-None-0-csv-0.1] __________

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr43')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None
on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:18:41,172 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4bd0270b5198abc9c84600ff73978b63', 0)
Function: subgraph_callable-f60f3e2c-6e85-43cb-b01b-c5c57c37
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr43/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___ test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] ____

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr47')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'
cat_cache = None, on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:18:43,498 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4b353e97dab5f3f10917661932e5e9dc', 0)
Function: subgraph_callable-6a1bdeef-16c0-4d80-92c5-33cc15c9
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr47/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________ test_dask_preproc_cpu[True-None-parquet] ___________________

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
engine = 'parquet', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
2022-08-09 14:19:24,201 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-a7389f08cd92af659ecf786a270fd236', 14)
Function: subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
2022-08-09 14:19:24,204 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-a7389f08cd92af659ecf786a270fd236', 13)
Function: subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:24,205 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-a7389f08cd92af659ecf786a270fd236', 15)
Function: subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

_____________________ test_dask_preproc_cpu[True-None-csv] _____________________

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
engine = 'csv', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:19:25,124 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 13)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,128 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 2)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,129 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 12)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,129 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 14)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

--------------------------- Captured stderr teardown ---------------------------
2022-08-09 14:19:25,135 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 0)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,136 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 1)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,137 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 11)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,137 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 10)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,138 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 15)
Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________

client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
engine = 'csv-no-header', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:19:25,811 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 22)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,813 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 20)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,815 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 21)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,816 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 16)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,816 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 18)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,818 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 17)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,824 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 19)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

--------------------------- Captured stderr teardown ---------------------------
2022-08-09 14:19:25,837 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 26)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,842 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 28)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,845 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 24)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,847 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 23)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,849 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 27)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,852 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 25)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,864 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 30)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,868 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 29)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:19:25,870 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 31)
Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________________ test_s3_dataset[parquet] ___________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7f6098bd2d00>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7f6186e61220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None)
method = 'PUT', url = '/parquet', response = None
error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>: Failed to establish a new connection: [Errno 111] Connection refused')
_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>
_stacktrace = <traceback object at 0x7f60989d9c40>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7f6186e61220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>
conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
method = 'PUT', url = '/parquet'
timeout = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580>
chunked = False
httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609aa49100>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
message_body = None, args = (), kwargs = {'encode_chunked': False}
msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/'
s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}
paths = ['/tmp/pytest-of-jenkins/pytest-15/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-15/parquet0/dataset-1.parquet']
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
engine = 'parquet'
df = name-cat name-string id label x y
0 Ingrid Hannah 1031 999 -0.076963 0.314008
...la 1062 1029 0.995636 0.555042
4320 Charlie Dan 992 976 -0.958343 0.245327

[4321 rows x 6 columns]
patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)
/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context
client.create_bucket(Bucket=bucket, ACL="public-read-write")
/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call
return self._make_api_call(operation_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call
http, parsed_response = self._make_request(
/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request
return self._endpoint.make_request(operation_model, request_dict)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request
return self._send_request(request_dict, operation_model)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request
while self._needs_retry(
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry
responses = self._event_emitter.emit(
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit
return self._emitter.emit(aliased_event_name, **kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit
return self._emit(event_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit
response = handler(**kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call
if self._checker(**checker_kwargs):
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call
should_retry = self._should_retry(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry
return self._checker(attempt_number, response, caught_exception)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call
checker_response = checker(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call
return self._check_caught_exception(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception
raise caught_exception
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response
http_response = self._send(request)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send
return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7f6098bd2d00>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError
---------------------------- Captured stderr setup -----------------------------
Traceback (most recent call last):
File "/usr/local/bin/moto_server", line 5, in
from moto.server import main
File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in
from moto.moto_server.werkzeug_app import (
File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in
from flask import Flask
File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in
from . import json as json
File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in
from ..globals import current_app
File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in
app_ctx: "AppContext" = LocalProxy( # type: ignore[assignment]
TypeError: init() got an unexpected keyword argument 'unbound_message'
_____________________________ test_s3_dataset[csv] _____________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7f609b7bbcd0>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7f6186e61220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None)
method = 'PUT', url = '/csv', response = None
error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>: Failed to establish a new connection: [Errno 111] Connection refused')
_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>
_stacktrace = <traceback object at 0x7f609855e040>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7f6186e61220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>
conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
method = 'PUT', url = '/csv'
timeout = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970>
chunked = False
httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f60c050eee0>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
message_body = None, args = (), kwargs = {'encode_chunked': False}
msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/'
s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}
paths = ['/tmp/pytest-of-jenkins/pytest-15/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-15/csv0/dataset-1.csv']
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}
engine = 'csv'
df = name-string id label x y
0 Hannah 1031 999 -0.076963 0.314008
1 Sarah ... Ursula 1062 1029 0.995636 0.555042
2160 Dan 992 976 -0.958343 0.245327

[4321 rows x 5 columns]
patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)
/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context
client.create_bucket(Bucket=bucket, ACL="public-read-write")
/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call
return self._make_api_call(operation_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call
http, parsed_response = self._make_request(
/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request
return self._endpoint.make_request(operation_model, request_dict)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request
return self._send_request(request_dict, operation_model)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request
while self._needs_retry(
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry
responses = self._event_emitter.emit(
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit
return self._emitter.emit(aliased_event_name, **kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit
return self._emit(event_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit
response = handler(**kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call
if self._checker(**checker_kwargs):
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call
should_retry = self._should_retry(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry
return self._checker(attempt_number, response, caught_exception)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call
checker_response = checker(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call
return self._check_caught_exception(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception
raise caught_exception
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response
http_response = self._send(request)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send
return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7f609b7bbcd0>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError
_______________________ test_drop_low_cardinality[True] ________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_drop_low_cardinality_True0')
cpu = True

@pytest.mark.parametrize("cpu", _CPU)
def test_drop_low_cardinality(tmpdir, cpu):
    df = pd.DataFrame()
    if not cpu:
        df = cudf.DataFrame(df)

    df["col1"] = ["a", "a", "a", "a", "a"]
    df["col2"] = ["a", "a", "a", "a", "b"]
    df["col3"] = ["a", "a", "b", "b", "c"]

    features = list(df.columns) >> nvt.ops.Categorify() >> nvt.ops.DropLowCardinality()

    workflow = nvt.Workflow(features)
    transformed = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()
  assert workflow.output_schema.column_names == ["col2", "col3"]

E AssertionError: assert ['col3'] == ['col2', 'col3']
E At index 0 diff: 'col3' != 'col2'
E Right contains one more item: 'col3'
E Full diff:
E - ['col2', 'col3']
E + ['col3']

tests/unit/ops/test_drop_low_cardinality.py:45: AssertionError
_______________________ test_drop_low_cardinality[False] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_drop_low_cardinality_Fals0')
cpu = False

@pytest.mark.parametrize("cpu", _CPU)
def test_drop_low_cardinality(tmpdir, cpu):
    df = pd.DataFrame()
    if not cpu:
        df = cudf.DataFrame(df)

    df["col1"] = ["a", "a", "a", "a", "a"]
    df["col2"] = ["a", "a", "a", "a", "b"]
    df["col3"] = ["a", "a", "b", "b", "c"]

    features = list(df.columns) >> nvt.ops.Categorify() >> nvt.ops.DropLowCardinality()

    workflow = nvt.Workflow(features)
    transformed = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()
  assert workflow.output_schema.column_names == ["col2", "col3"]

E AssertionError: assert ['col3'] == ['col2', 'col3']
E At index 0 diff: 'col3' != 'col2'
E Right contains one more item: 'col3'
E Full diff:
E - ['col2', 'col3']
E + ['col3']

tests/unit/ops/test_drop_low_cardinality.py:45: AssertionError
_____________________ test_cpu_workflow[True-True-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0')
df = name-cat name-string id label x y
0 Ingrid Hannah 1031 999 -0.076963 0.314008
...la 1062 1029 0.995636 0.555042
4320 Charlie Dan 992 976 -0.958343 0.245327

[4321 rows x 6 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7f5ff87cc1c0>, cpu = True
engine = 'parquet', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
_______________________ test_cpu_workflow[True-True-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0')
df = name-string id label x y
0 Hannah 1031 999 -0.076963 0.314008
1 Sarah ... Ursula 1062 1029 0.995636 0.555042
2160 Dan 992 976 -0.958343 0.245327

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7f601464dbe0>, cpu = True
engine = 'csv', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
__________________ test_cpu_workflow[True-True-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1')
df = name-string id label x y
0 Hannah 1031 999 -0.076963 0.314008
1 Sarah ... Ursula 1062 1029 0.995636 0.555042
2160 Dan 992 976 -0.958343 0.245327

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7f60642f3100>, cpu = True
engine = 'csv-no-header', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
____________________ test_cpu_workflow[True-False-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0')
df = name-cat name-string id label x y
0 Ingrid Hannah 1031 999 -0.076963 0.314008
...la 1062 1029 0.995636 0.555042
4320 Charlie Dan 992 976 -0.958343 0.245327

[4321 rows x 6 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7f6014783a30>, cpu = True
engine = 'parquet', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
______________________ test_cpu_workflow[True-False-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0')
df = name-string id label x y
0 Hannah 1031 999 -0.076963 0.314008
1 Sarah ... Ursula 1062 1029 0.995636 0.555042
2160 Dan 992 976 -0.958343 0.245327

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7f6014644790>, cpu = True
engine = 'csv', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
_________________ test_cpu_workflow[True-False-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1')
df = name-string id label x y
0 Hannah 1031 999 -0.076963 0.314008
1 Sarah ... Ursula 1062 1029 0.995636 0.555042
2160 Dan 992 976 -0.958343 0.245327

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7f6014744df0>, cpu = True
engine = 'csv-no-header', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

nvtabular/loader/init.py:19
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.
warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning
tests/unit/test_tf4rec.py: 1 warning
tests/unit/test_tools.py: 5 warnings
tests/unit/test_triton_inference.py: 8 warnings
tests/unit/loader/test_dataloader_backend.py: 6 warnings
tests/unit/loader/test_tf_dataloader.py: 66 warnings
tests/unit/loader/test_torch_dataloader.py: 67 warnings
tests/unit/ops/test_categorify.py: 69 warnings
tests/unit/ops/test_drop_low_cardinality.py: 2 warnings
tests/unit/ops/test_fill.py: 8 warnings
tests/unit/ops/test_hash_bucket.py: 4 warnings
tests/unit/ops/test_join.py: 88 warnings
tests/unit/ops/test_lambda.py: 1 warning
tests/unit/ops/test_normalize.py: 9 warnings
tests/unit/ops/test_ops.py: 11 warnings
tests/unit/ops/test_ops_schema.py: 17 warnings
tests/unit/workflow/test_workflow.py: 27 warnings
tests/unit/workflow/test_workflow_chaining.py: 1 warning
tests/unit/workflow/test_workflow_node.py: 1 warning
tests/unit/workflow/test_workflow_schemas.py: 1 warning
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.
warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers
/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.
warnings.warn(

tests/unit/test_notebooks.py: 1 warning
tests/unit/test_tools.py: 17 warnings
tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 54 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future
warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 12 warnings
tests/unit/workflow/test_workflow.py: 9 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.
warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings
tests/unit/workflow/test_workflow.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.
warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-0-csv-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet]
FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py...
FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header]
FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions....
FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp...
FAILED tests/unit/ops/test_drop_low_cardinality.py::test_drop_low_cardinality[True]
FAILED tests/unit/ops/test_drop_low_cardinality.py::test_drop_low_cardinality[False]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header]
===== 19 failed, 1412 passed, 1 skipped, 617 warnings in 747.37s (0:12:27) =====
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins11395841751843227978.sh

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #1641 of commit 729eb88f3ebd2064c0eea2acb040ed23aa0e5191, no merge conflicts.
Running as SYSTEM
Setting status of 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4616/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 729eb88f3ebd2064c0eea2acb040ed23aa0e5191^{commit} # timeout=10
Checking out Revision 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 # timeout=10
Commit message: "Update `DropLowCardinality` to handle changes to `Categorify` domain"
 > git rev-list --no-walk 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins1109161135988901750.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items

tests/unit/test_dask_nvt.py ............................F..F......F..F.. [ 3%]
..................................................................F.F... [ 8%]
.... [ 8%]
tests/unit/test_notebooks.py ...... [ 8%]
tests/unit/test_s3.py FF [ 8%]
tests/unit/test_tf4rec.py . [ 9%]
tests/unit/test_tools.py ...................... [ 10%]
tests/unit/test_triton_inference.py ................................ [ 12%]
tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%]
tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]
................................................... [ 18%]
tests/unit/framework_utils/test_torch_layers.py . [ 18%]
tests/unit/loader/test_dataloader_backend.py ...... [ 18%]
tests/unit/loader/test_tf_dataloader.py ................................ [ 21%]
........................................s.. [ 24%]
tests/unit/loader/test_torch_dataloader.py ............................. [ 26%]
...................................................... [ 29%]
tests/unit/ops/test_categorify.py ...................................... [ 32%]
........................................................................ [ 37%]
........................................... [ 40%]
tests/unit/ops/test_column_similarity.py ........................ [ 42%]
tests/unit/ops/test_drop_low_cardinality.py .. [ 42%]
tests/unit/ops/test_fill.py ............................................ [ 45%]
........ [ 45%]
tests/unit/ops/test_groupyby.py ..................... [ 47%]
tests/unit/ops/test_hash_bucket.py ......................... [ 49%]
tests/unit/ops/test_join.py ............................................ [ 52%]
........................................................................ [ 57%]
.................................. [ 59%]
tests/unit/ops/test_lambda.py .......... [ 60%]
tests/unit/ops/test_normalize.py ....................................... [ 63%]
.. [ 63%]
tests/unit/ops/test_ops.py ............................................. [ 66%]
.................... [ 67%]
tests/unit/ops/test_ops_schema.py ...................................... [ 70%]
........................................................................ [ 75%]
........................................................................ [ 80%]
........................................................................ [ 85%]
....................................... [ 88%]
tests/unit/ops/test_reduce_dtype_size.py .. [ 88%]
tests/unit/ops/test_target_encode.py ..................... [ 89%]
tests/unit/workflow/test_cpu_workflow.py FFFFFF [ 90%]
tests/unit/workflow/test_workflow.py ................................... [ 92%]
.......................................................... [ 96%]
tests/unit/workflow/test_workflow_chaining.py ... [ 96%]
tests/unit/workflow/test_workflow_node.py ........... [ 97%]
tests/unit/workflow/test_workflow_ops.py ... [ 97%]
tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]
... [100%]

=================================== FAILURES ===================================
________ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-0.1] ________

client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr28')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv'
cat_cache = 'device', on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:36:35,353 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-2ff0376c0374f06523b9f25395b72dfc', 1)
Function: subgraph_callable-0d5ad759-7370-49ea-a9f7-33f00b22
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr28/processed/part_1.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

__________ test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] __________

client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr31')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None
on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:36:37,385 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-a5a644fc8c79cdf9ae2635ed2b300f6c', 1)
Function: subgraph_callable-62f18cdd-3485-404a-8218-65bd48c6
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr31/processed/part_1.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___ test_dask_workflow_api_dlrm[True-None-False-device-0-csv-no-header-0.1] ____

client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr38')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'
cat_cache = 'device', on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:36:41,594 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-e664934af21c7d272636a2d73892785d', 0)
Function: subgraph_callable-d16b9c8a-2683-4a79-84d1-534bcf89
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr38/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___

client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr41')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'
cat_cache = 'device', on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:36:43,398 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-94ec7dad78e51dd2f6113a2a4ddd9178', 0)
Function: subgraph_callable-30db34bb-e4a3-4f14-af64-4e18b807
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr41/processed/part_0.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________ test_dask_preproc_cpu[True-None-parquet] ___________________

client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
engine = 'parquet', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
2022-08-09 14:37:28,440 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 10)
Function: subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_2.parquet', [2], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
2022-08-09 14:37:28,445 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 11)
Function: subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_2.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

--------------------------- Captured stderr teardown ---------------------------
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������2022-08-09 14:37:28,450 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 15)
Function: subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________

client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2')
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
engine = 'csv-no-header', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute
(result,) = compute(self, traverse=False, **kwargs)
/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute
results = schedule(dsk, keys, **kwargs)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather
return self.sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync
return sync(
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync
raise exc.with_traceback(tb)
/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f
result = yield future
/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run
value = future.result()
/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather
raise exception.with_traceback(traceback)
/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get
result = _execute_task(task, cache)
/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call
return read_parquet_part(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part
dfs = [
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition
arrow_table = cls._read_table(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table
arrow_table = _read_table_from_path(
/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path
return pq.ParquetFile(fil).read_row_groups(
/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init
self.reader.open(
pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open
???


???
E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid
----------------------------- Captured stderr call -----------------------------
2022-08-09 14:37:29,740 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-51ae06915442aa05c68392572c80ee96', 12)
Function: subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [0], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:37:29,741 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-51ae06915442aa05c68392572c80ee96', 13)
Function: subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [1], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-09 14:37:29,746 - distributed.worker - WARNING - Compute Failed
Key: ('read-parquet-51ae06915442aa05c68392572c80ee96', 15)
Function: subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6
args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [3], [])})
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________________ test_s3_dataset[parquet] ___________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7fe4907e4370>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7fe561827220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None)
method = 'PUT', url = '/parquet', response = None
error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>: Failed to establish a new connection: [Errno 111] Connection refused')
_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>
_stacktrace = <traceback object at 0x7fe457dc5b00>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7fe561827220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>
conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
method = 'PUT', url = '/parquet'
timeout = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910>
chunked = False
httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe457beb2e0>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
method = 'PUT', url = '/parquet', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
message_body = None, args = (), kwargs = {'encode_chunked': False}
msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/'
s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}
paths = ['/tmp/pytest-of-jenkins/pytest-18/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-18/parquet0/dataset-1.parquet']
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
engine = 'parquet'
df = name-cat name-string id label x y
0 Alice Victor 973 995 -0.613973 -0.434246
...dy 964 1065 -0.263394 -0.013804
4320 Jerry Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 6 columns]
patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)
/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context
client.create_bucket(Bucket=bucket, ACL="public-read-write")
/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call
return self._make_api_call(operation_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call
http, parsed_response = self._make_request(
/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request
return self._endpoint.make_request(operation_model, request_dict)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request
return self._send_request(request_dict, operation_model)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request
while self._needs_retry(
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry
responses = self._event_emitter.emit(
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit
return self._emitter.emit(aliased_event_name, **kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit
return self._emit(event_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit
response = handler(**kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call
if self._checker(**checker_kwargs):
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call
should_retry = self._should_retry(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry
return self._checker(attempt_number, response, caught_exception)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call
checker_response = checker(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call
return self._check_caught_exception(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception
raise caught_exception
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response
http_response = self._send(request)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send
return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7fe4907e4370>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError
---------------------------- Captured stderr setup -----------------------------
Traceback (most recent call last):
File "/usr/local/bin/moto_server", line 5, in
from moto.server import main
File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in
from moto.moto_server.werkzeug_app import (
File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in
from flask import Flask
File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in
from . import json as json
File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in
from ..globals import current_app
File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in
app_ctx: "AppContext" = LocalProxy( # type: ignore[assignment]
TypeError: init() got an unexpected keyword argument 'unbound_message'
_____________________________ test_s3_dataset[csv] _____________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None
socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7fe45554bdf0>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7fe561827220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None)
method = 'PUT', url = '/csv', response = None
error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>: Failed to establish a new connection: [Errno 111] Connection refused')
_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>
_stacktrace = <traceback object at 0x7fe457dbb880>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)
redirect = True, assert_same_host = False
timeout = <object object at 0x7fe561827220>, pool_timeout = None
release_conn = False, chunked = False, body_pos = None
response_kw = {'decode_content': False, 'preload_content': False}, conn = None
release_this_conn = True, err = None, clean_exit = False
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0>
is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>
conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
method = 'PUT', url = '/csv'
timeout = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0>
chunked = False
httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}
timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48d102f70>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
method = 'PUT', url = '/csv', body = None
headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
message_body = None, args = (), kwargs = {'encode_chunked': False}
msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/'
s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}
paths = ['/tmp/pytest-of-jenkins/pytest-18/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-18/csv0/dataset-1.csv']
datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}
engine = 'csv'
df = name-string id label x y
0 Victor 973 995 -0.613973 -0.434246
1 Bob ... Wendy 964 1065 -0.263394 -0.013804
2160 Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 5 columns]
patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)
/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context
client.create_bucket(Bucket=bucket, ACL="public-read-write")
/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call
return self._make_api_call(operation_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call
http, parsed_response = self._make_request(
/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request
return self._endpoint.make_request(operation_model, request_dict)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request
return self._send_request(request_dict, operation_model)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request
while self._needs_retry(
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry
responses = self._event_emitter.emit(
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit
return self._emitter.emit(aliased_event_name, **kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit
return self._emit(event_name, kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit
response = handler(**kwargs)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call
if self._checker(**checker_kwargs):
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call
should_retry = self._should_retry(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry
return self._checker(attempt_number, response, caught_exception)
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call
checker_response = checker(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call
return self._check_caught_exception(
/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception
raise caught_exception
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response
http_response = self._send(request)
/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send
return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7fe45554bdf0>
request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError
_____________________ test_cpu_workflow[True-True-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0')
df = name-cat name-string id label x y
0 Alice Victor 973 995 -0.613973 -0.434246
...dy 964 1065 -0.263394 -0.013804
4320 Jerry Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 6 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7fe42818a0a0>, cpu = True
engine = 'parquet', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
_______________________ test_cpu_workflow[True-True-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0')
df = name-string id label x y
0 Victor 973 995 -0.613973 -0.434246
1 Bob ... Wendy 964 1065 -0.263394 -0.013804
2160 Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7fe40c68ee50>, cpu = True
engine = 'csv', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
__________________ test_cpu_workflow[True-True-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1')
df = name-string id label x y
0 Victor 973 995 -0.613973 -0.434246
1 Bob ... Wendy 964 1065 -0.263394 -0.013804
2160 Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7fe45514a1c0>, cpu = True
engine = 'csv-no-header', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
____________________ test_cpu_workflow[True-False-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0')
df = name-cat name-string id label x y
0 Alice Victor 973 995 -0.613973 -0.434246
...dy 964 1065 -0.263394 -0.013804
4320 Jerry Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 6 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7fe4890b4160>, cpu = True
engine = 'parquet', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
______________________ test_cpu_workflow[True-False-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0')
df = name-string id label x y
0 Victor 973 995 -0.613973 -0.434246
1 Bob ... Wendy 964 1065 -0.263394 -0.013804
2160 Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7fe40c2770d0>, cpu = True
engine = 'csv', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
_________________ test_cpu_workflow[True-False-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1')
df = name-string id label x y
0 Victor 973 995 -0.613973 -0.434246
1 Bob ... Wendy 964 1065 -0.263394 -0.013804
2160 Ursula 970 1009 -0.394831 -0.651957

[4321 rows x 5 columns]
dataset = <merlin.io.dataset.Dataset object at 0x7fe4280583d0>, cpu = True
engine = 'csv-no-header', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init
self.engine = ParquetDatasetEngine(
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init
self._path0,
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0
return next(self._dataset.get_fragments()).path
/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset
dataset = pa_ds.dataset(paths, filesystem=fs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset
return _filesystem_dataset(source, **kwargs)
/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset
return factory.finish(schema)
pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish
???
pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status
???


???
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

nvtabular/loader/init.py:19
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.
warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning
tests/unit/test_tf4rec.py: 1 warning
tests/unit/test_tools.py: 5 warnings
tests/unit/test_triton_inference.py: 8 warnings
tests/unit/loader/test_dataloader_backend.py: 6 warnings
tests/unit/loader/test_tf_dataloader.py: 66 warnings
tests/unit/loader/test_torch_dataloader.py: 67 warnings
tests/unit/ops/test_categorify.py: 69 warnings
tests/unit/ops/test_drop_low_cardinality.py: 2 warnings
tests/unit/ops/test_fill.py: 8 warnings
tests/unit/ops/test_hash_bucket.py: 4 warnings
tests/unit/ops/test_join.py: 88 warnings
tests/unit/ops/test_lambda.py: 1 warning
tests/unit/ops/test_normalize.py: 9 warnings
tests/unit/ops/test_ops.py: 11 warnings
tests/unit/ops/test_ops_schema.py: 17 warnings
tests/unit/workflow/test_workflow.py: 27 warnings
tests/unit/workflow/test_workflow_chaining.py: 1 warning
tests/unit/workflow/test_workflow_node.py: 1 warning
tests/unit/workflow/test_workflow_schemas.py: 1 warning
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.
warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers
/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.
warnings.warn(

tests/unit/test_notebooks.py: 1 warning
tests/unit/test_tools.py: 17 warnings
tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 54 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future
warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 12 warnings
tests/unit/workflow/test_workflow.py: 9 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.
warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings
tests/unit/workflow/test_workflow.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.
warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-0-csv-no-header-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1]
FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet]
FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header]
FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions....
FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp...
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv]
FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header]
===== 14 failed, 1417 passed, 1 skipped, 617 warnings in 722.15s (0:12:02) =====
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins2653689168443043854.sh

@github-actions
Copy link

github-actions bot commented Aug 9, 2022

Documentation preview

https://nvidia-merlin.github.io/NVTabular/review/pr-1641

@oliverholworthy
Copy link
Member Author

rerun tests

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
Running as SYSTEM
Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4629/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10
Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
Commit message: "Merge branch 'main' into categorify-domain-max"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins5697026500764221364.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%]
........................................................................ [ 8%]
.... [ 8%]
tests/unit/test_notebooks.py ...... [ 8%]
tests/unit/test_tf4rec.py . [ 8%]
tests/unit/test_tools.py ...................... [ 10%]
tests/unit/test_triton_inference.py Build timed out (after 60 minutes). Marking the build as failed.
Build was aborted
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins6758279656828584530.sh

@karlhigley
Copy link
Contributor

rerun tests

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
Running as SYSTEM
Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4630/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10
Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
Commit message: "Merge branch 'main' into categorify-domain-max"
 > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins11669948025439148038.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%]
........................................................................ [ 8%]
.... [ 8%]
tests/unit/test_notebooks.py ...... [ 8%]
tests/unit/test_tf4rec.py . [ 8%]
tests/unit/test_tools.py ...................... [ 10%]
tests/unit/test_triton_inference.py ................................ [ 12%]
tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%]
tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]
..................................................F [ 18%]
tests/unit/framework_utils/test_torch_layers.py . [ 18%]
tests/unit/loader/test_dataloader_backend.py ...... [ 18%]
tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]
........................................s.. [ 23%]
tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]
...................................................... [ 29%]
tests/unit/ops/test_categorify.py ...................................... [ 32%]
........................................................................ [ 37%]
........................................... [ 40%]
tests/unit/ops/test_column_similarity.py ........................ [ 42%]
tests/unit/ops/test_drop_low_cardinality.py .. [ 42%]
tests/unit/ops/test_fill.py ............................................ [ 45%]
........ [ 45%]
tests/unit/ops/test_groupyby.py ..................... [ 47%]
tests/unit/ops/test_hash_bucket.py ......................... [ 49%]
tests/unit/ops/test_join.py ............................................ [ 52%]
........................................................................ [ 57%]
.................................. [ 59%]
tests/unit/ops/test_lambda.py .......... [ 60%]
tests/unit/ops/test_normalize.py ....................................... [ 63%]
.. [ 63%]
tests/unit/ops/test_ops.py ............................................. [ 66%]
.................... [ 67%]
tests/unit/ops/test_ops_schema.py ...................................... [ 70%]
........................................................................ [ 75%]
........................................................................ [ 80%]
........................................................................ [ 85%]
....................................... [ 88%]
tests/unit/ops/test_reduce_dtype_size.py .. [ 88%]
tests/unit/ops/test_target_encode.py ..................... [ 89%]
tests/unit/workflow/test_cpu_workflow.py ...... [ 90%]
tests/unit/workflow/test_workflow.py ................................... [ 92%]
.......................................................... [ 96%]
tests/unit/workflow/test_workflow_chaining.py ... [ 96%]
tests/unit/workflow/test_workflow_node.py ........... [ 97%]
tests/unit/workflow/test_workflow_ops.py ... [ 97%]
tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]
... [100%]

=================================== FAILURES ===================================
___________________________ test_multihot_empty_rows ___________________________

def test_multihot_empty_rows():
    multi_hot = tf.feature_column.categorical_column_with_identity("multihot", 5)
    multi_hot_embedding = tf.feature_column.embedding_column(multi_hot, 8, combiner="sum")

    embedding_layer = layers.DenseFeatures([multi_hot_embedding])
    inputs = {
        "multihot": (
            tf.keras.Input(name="multihot__values", shape=(1,), dtype=tf.int64),
            tf.keras.Input(name="multihot__nnzs", shape=(1,), dtype=tf.int64),
        )
    }
    output = embedding_layer(inputs)

    model = tf.keras.Model(inputs=inputs, outputs=output)
    model.compile("sgd", "binary_crossentropy")

    multi_hot_values = np.array([0, 2, 1, 4, 1, 3, 1])
    multi_hot_nnzs = np.array([1, 0, 2, 4, 0])
    x = {"multihot": (multi_hot_values[:, None], multi_hot_nnzs[:, None])}

    multi_hot_embedding_table = embedding_layer.embedding_tables["multihot"].numpy()
    multi_hot_embedding_rows = _compute_expected_multi_hot(
        multi_hot_embedding_table, multi_hot_values, multi_hot_nnzs, "sum"
    )

    y_hat = model(x).numpy()
  np.testing.assert_allclose(y_hat, multi_hot_embedding_rows, rtol=1e-06)

E AssertionError:
E Not equal to tolerance rtol=1e-06, atol=0
E
E Mismatched elements: 1 / 40 (2.5%)
E Max absolute difference: 1.1920929e-07
E Max relative difference: 1.502241e-06
E x: array([[-0.29789 , -0.016212, -0.051031, -0.248089, 0.250163, -0.30276 ,
E -0.253522, -0.074231],
E [ 0. , 0. , 0. , 0. , 0. , 0. ,...
E y: array([[-0.29789 , -0.016212, -0.051031, -0.248089, 0.250163, -0.30276 ,
E -0.253522, -0.074231],
E [ 0. , 0. , 0. , 0. , 0. , 0. ,...

tests/unit/framework_utils/test_tf_layers.py:321: AssertionError
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

nvtabular/loader/init.py:19
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.
warnings.warn(

tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]
/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first
self.make_current()

tests/unit/test_dask_nvt.py: 1 warning
tests/unit/test_tf4rec.py: 1 warning
tests/unit/test_tools.py: 5 warnings
tests/unit/test_triton_inference.py: 8 warnings
tests/unit/loader/test_dataloader_backend.py: 6 warnings
tests/unit/loader/test_tf_dataloader.py: 66 warnings
tests/unit/loader/test_torch_dataloader.py: 67 warnings
tests/unit/ops/test_categorify.py: 69 warnings
tests/unit/ops/test_drop_low_cardinality.py: 2 warnings
tests/unit/ops/test_fill.py: 8 warnings
tests/unit/ops/test_hash_bucket.py: 4 warnings
tests/unit/ops/test_join.py: 88 warnings
tests/unit/ops/test_lambda.py: 1 warning
tests/unit/ops/test_normalize.py: 9 warnings
tests/unit/ops/test_ops.py: 11 warnings
tests/unit/ops/test_ops_schema.py: 17 warnings
tests/unit/workflow/test_workflow.py: 27 warnings
tests/unit/workflow/test_workflow_chaining.py: 1 warning
tests/unit/workflow/test_workflow_node.py: 1 warning
tests/unit/workflow/test_workflow_schemas.py: 1 warning
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.
warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers
/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.
warnings.warn(

tests/unit/test_notebooks.py: 1 warning
tests/unit/test_tools.py: 17 warnings
tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 54 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future
warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 12 warnings
tests/unit/workflow/test_workflow.py: 9 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.
warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings
tests/unit/workflow/test_workflow.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.
warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/framework_utils/test_tf_layers.py::test_multihot_empty_rows
===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 722.63s (0:12:02) ======
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins4753025854975341293.sh

@karlhigley
Copy link
Contributor

rerun tests

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
Running as SYSTEM
Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4631/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10
Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
Commit message: "Merge branch 'main' into categorify-domain-max"
 > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins9182884185066325902.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%]
........................................................................ [ 8%]
.... [ 8%]
tests/unit/test_notebooks.py ...... [ 8%]
tests/unit/test_tf4rec.py . [ 8%]
tests/unit/test_tools.py ...................... [ 10%]
tests/unit/test_triton_inference.py ................................ [ 12%]
tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%]
tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]
................................................... [ 18%]
tests/unit/framework_utils/test_torch_layers.py . [ 18%]
tests/unit/loader/test_dataloader_backend.py ...... [ 18%]
tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]
........................................s.. [ 23%]
tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]
...................................................... [ 29%]
tests/unit/ops/test_categorify.py ...................................... [ 32%]
........................................................................ [ 37%]
........................................... [ 40%]
tests/unit/ops/test_column_similarity.py ........................ [ 42%]
tests/unit/ops/test_drop_low_cardinality.py .. [ 42%]
tests/unit/ops/test_fill.py ............................................ [ 45%]
........ [ 45%]
tests/unit/ops/test_groupyby.py ..................... [ 47%]
tests/unit/ops/test_hash_bucket.py ......................... [ 49%]
tests/unit/ops/test_join.py ............................................ [ 52%]
........................................................................ [ 57%]
.................................. [ 59%]
tests/unit/ops/test_lambda.py .......... [ 60%]
tests/unit/ops/test_normalize.py ....................................... [ 63%]
.. [ 63%]
tests/unit/ops/test_ops.py ............................................. [ 66%]
.................... [ 67%]
tests/unit/ops/test_ops_schema.py ...................................... [ 70%]
........................................................................ [ 75%]
........................................................................ [ 80%]
........................................................................ [ 85%]
....................................... [ 88%]
tests/unit/ops/test_reduce_dtype_size.py .. [ 88%]
tests/unit/ops/test_target_encode.py ..................... [ 89%]
tests/unit/workflow/test_cpu_workflow.py ...... [ 90%]
tests/unit/workflow/test_workflow.py ................................... [ 92%]
.......................................................... [ 96%]
tests/unit/workflow/test_workflow_chaining.py ... [ 96%]
tests/unit/workflow/test_workflow_node.py ........... [ 97%]
tests/unit/workflow/test_workflow_ops.py ... [ 97%]
tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]
... [100%]

=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

nvtabular/loader/init.py:19
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.
warnings.warn(

tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]
/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first
self.make_current()

tests/unit/test_dask_nvt.py: 1 warning
tests/unit/test_tf4rec.py: 1 warning
tests/unit/test_tools.py: 5 warnings
tests/unit/test_triton_inference.py: 8 warnings
tests/unit/loader/test_dataloader_backend.py: 6 warnings
tests/unit/loader/test_tf_dataloader.py: 66 warnings
tests/unit/loader/test_torch_dataloader.py: 67 warnings
tests/unit/ops/test_categorify.py: 69 warnings
tests/unit/ops/test_drop_low_cardinality.py: 2 warnings
tests/unit/ops/test_fill.py: 8 warnings
tests/unit/ops/test_hash_bucket.py: 4 warnings
tests/unit/ops/test_join.py: 88 warnings
tests/unit/ops/test_lambda.py: 1 warning
tests/unit/ops/test_normalize.py: 9 warnings
tests/unit/ops/test_ops.py: 11 warnings
tests/unit/ops/test_ops_schema.py: 17 warnings
tests/unit/workflow/test_workflow.py: 27 warnings
tests/unit/workflow/test_workflow_chaining.py: 1 warning
tests/unit/workflow/test_workflow_node.py: 1 warning
tests/unit/workflow/test_workflow_schemas.py: 1 warning
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.
warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers
/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.
warnings.warn(

tests/unit/test_notebooks.py: 1 warning
tests/unit/test_tools.py: 17 warnings
tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 54 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future
warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 12 warnings
tests/unit/workflow/test_workflow.py: 9 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.
warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings
tests/unit/workflow/test_workflow.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.
warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========== 1429 passed, 2 skipped, 618 warnings in 709.60s (0:11:49) ===========
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins5371199130175045927.sh

@karlhigley karlhigley merged commit 934a326 into NVIDIA-Merlin:main Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants