Update the `Categorify` operator to set the domain max correctly #1641

oliverholworthy · 2022-08-09T14:17:11Z

Goal

Reduce the resulting int_domain.max property by one on a ColumnSchema after transforming with Categorify. To match the data correctly.

Motivation / Context

This PR was motivated by work on NVIDIA-Merlin/Merlin#479

We are using the domain.max to compute the vocab size / cardinality when creating embedding tables in Merlin Models. This off-by-one error is resulting in some confusion when creating the correct shape embedding dimensions from pretrained embedding data.

Example

import uuid
import pandas as pd

from merlin.io import Dataset
from merlin.transforms import Workflow
from merlin.transforms.ops import Categorify

df = pd.DataFrame({"id": [str(uuid.uuid4()) for _ in range(2)]})
dataset = Dataset(df)

dataset

                                     id
0  fc5f18c4-919f-4496-9209-1ae34aa4230d
1  738f873f-5fa7-4345-9daa-b1f714c9f1aa

dataset.schema

  name tags   dtype  is_list  is_ragged
0   id   ()  object    False      False

After the Categorify op, these ids are transformed to integers {1, 2} with 0 reserved for out-of-vocabulary. So we have a cardinality of 3 (including the zero).

workflow = Workflow(["id"] >> Categorify())
transformed_dataset = workflow.fit_transform(dataset)

transformed_dataset:

   id
0   2
1   1

transformed_dataset.schema

  name                tags  dtype  is_list  is_ragged properties.num_buckets  \
0   id  (Tags.CATEGORICAL)  int64    False      False                   None   

   properties.freq_threshold  properties.max_size  properties.start_index  \
0                          0                    0                       0   

               properties.cat_path  properties.domain.min  \
0  .//categories/unique.id.parquet                      0   

   properties.domain.max properties.domain.name  \
0                      3                     id   

   properties.embedding_sizes.cardinality  \
0                                       3   

   properties.embedding_sizes.dimension  
0                                    16

However, with the current implementation the int_domain.max value after the transform in this example is 3. This is the same value as the cardinality. However, the maximum integer value is one less than the cardinality here which is 2.

nvidia-merlin-bot · 2022-08-09T14:30:37Z

Click to view CI Results

GitHub pull request #1641 of commit 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4, no merge conflicts.
Running as SYSTEM
Setting status of 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4615/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4^{commit} # timeout=10
Checking out Revision 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10
Commit message: "Update `Categorify` operator to set the domain max correctly"
 > git rev-list --no-walk c2a5b743c7a0b458be7af4ca96da091887a044b9 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins14816013642204511087.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items
tests/unit/test_dask_nvt.py ..........................F..F..........FF.F [  3%]

...F..............................................................FFF... [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py FF                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  9%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 21%]

........................................s..                              [ 24%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 26%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py FF                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py FFFFFF                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr26')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:18:30,251 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-1852321342b43b35c1c4d664628b409a', 0)

Function:  subgraph_callable-11603efe-dc29-4a19-8a57-01e58196

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr26/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr29')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:18:32,258 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-7dbace1faf4aac58bf4b5a9808158f3e', 1)

Function:  subgraph_callable-ddf3578b-55e1-4bac-ad05-4a5c5fa7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr29/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_______ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-0.1] ________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr40')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv'

cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:18:38,567 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-60a11de382c15140e4b6043c3fd932a0', 0)

Function:  subgraph_callable-5674b3d5-3fa9-4e20-8649-a7fb3f72

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr40/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr41')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:18:39,541 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c58c6d8297d1363e166bbae0ba2b7cbc', 0)

Function:  subgraph_callable-d0c1022a-5b14-4dfd-96bc-4d734a7f

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr41/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_________ test_dask_workflow_api_dlrm[True-None-False-None-0-csv-0.1] __________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr43')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None

on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:18:41,172 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4bd0270b5198abc9c84600ff73978b63', 0)

Function:  subgraph_callable-f60f3e2c-6e85-43cb-b01b-c5c57c37

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr43/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr47')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = None, on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:18:43,498 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4b353e97dab5f3f10917661932e5e9dc', 0)

Function:  subgraph_callable-6a1bdeef-16c0-4d80-92c5-33cc15c9

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr47/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________ test_dask_preproc_cpu[True-None-parquet] ___________________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

engine = 'parquet', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-09 14:19:24,201 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-a7389f08cd92af659ecf786a270fd236', 14)

Function:  subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-09 14:19:24,204 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-a7389f08cd92af659ecf786a270fd236', 13)

Function:  subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:24,205 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-a7389f08cd92af659ecf786a270fd236', 15)

Function:  subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_____________________ test_dask_preproc_cpu[True-None-csv] _____________________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

engine = 'csv', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:19:25,124 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 13)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,128 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 2)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,129 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 12)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,129 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 14)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown ---------------------------

2022-08-09 14:19:25,135 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 0)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,136 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 1)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,137 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 11)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,137 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 10)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,138 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 15)

Function:  subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

engine = 'csv-no-header', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:19:25,811 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 22)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,813 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 20)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,815 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 21)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,816 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 16)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,816 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 18)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,818 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 17)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,824 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 19)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown ---------------------------

2022-08-09 14:19:25,837 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 26)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,842 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 28)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,845 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 24)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,847 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 23)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,849 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 27)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,852 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 25)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,864 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 30)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,868 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 29)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,870 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 31)

Function:  subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________________ test_s3_dataset[parquet] ___________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7f6098bd2d00>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f6186e61220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/parquet', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>

_stacktrace = <traceback object at 0x7f60989d9c40>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f6186e61220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

method = 'PUT', url = '/parquet'

timeout = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609aa49100>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>

data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-15/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-15/parquet0/dataset-1.parquet']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

engine = 'parquet'

df =      name-cat name-string    id  label         x         y

0      Ingrid      Hannah  1031    999 -0.076963  0.314008

...la  1062   1029  0.995636  0.555042

4320  Charlie         Dan   992    976 -0.958343  0.245327
[4321 rows x 6 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7f6098bd2d00>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

---------------------------- Captured stderr setup -----------------------------

Traceback (most recent call last):

File "/usr/local/bin/moto_server", line 5, in 

from moto.server import main

File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in 

from moto.moto_server.werkzeug_app import (

File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in 

from flask import Flask

File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in 

from . import json as json

File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in 

from ..globals import current_app

File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in 

app_ctx: "AppContext" = LocalProxy(  # type: ignore[assignment]

TypeError: init() got an unexpected keyword argument 'unbound_message'

_____________________________ test_s3_dataset[csv] _____________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7f609b7bbcd0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f6186e61220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/csv', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>

_stacktrace = <traceback object at 0x7f609855e040>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f6186e61220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

method = 'PUT', url = '/csv'

timeout = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f60c050eee0>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>

data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-15/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-15/csv0/dataset-1.csv']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')}

engine = 'csv'

df =      name-string    id  label         x         y

0         Hannah  1031    999 -0.076963  0.314008

1          Sarah  ...     Ursula  1062   1029  0.995636  0.555042

2160         Dan   992    976 -0.958343  0.245327
[4321 rows x 5 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7f609b7bbcd0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

_______________________ test_drop_low_cardinality[True] ________________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_drop_low_cardinality_True0')

cpu = True
@pytest.mark.parametrize("cpu", _CPU)
def test_drop_low_cardinality(tmpdir, cpu):
    df = pd.DataFrame()
    if not cpu:
        df = cudf.DataFrame(df)

    df["col1"] = ["a", "a", "a", "a", "a"]
    df["col2"] = ["a", "a", "a", "a", "b"]
    df["col3"] = ["a", "a", "b", "b", "c"]

    features = list(df.columns) >> nvt.ops.Categorify() >> nvt.ops.DropLowCardinality()

    workflow = nvt.Workflow(features)
    transformed = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()


  assert workflow.output_schema.column_names == ["col2", "col3"]


E       AssertionError: assert ['col3'] == ['col2', 'col3']

E         At index 0 diff: 'col3' != 'col2'

E         Right contains one more item: 'col3'

E         Full diff:

E         - ['col2', 'col3']

E         + ['col3']
tests/unit/ops/test_drop_low_cardinality.py:45: AssertionError

_______________________ test_drop_low_cardinality[False] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_drop_low_cardinality_Fals0')

cpu = False
@pytest.mark.parametrize("cpu", _CPU)
def test_drop_low_cardinality(tmpdir, cpu):
    df = pd.DataFrame()
    if not cpu:
        df = cudf.DataFrame(df)

    df["col1"] = ["a", "a", "a", "a", "a"]
    df["col2"] = ["a", "a", "a", "a", "b"]
    df["col3"] = ["a", "a", "b", "b", "c"]

    features = list(df.columns) >> nvt.ops.Categorify() >> nvt.ops.DropLowCardinality()

    workflow = nvt.Workflow(features)
    transformed = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()


  assert workflow.output_schema.column_names == ["col2", "col3"]


E       AssertionError: assert ['col3'] == ['col2', 'col3']

E         At index 0 diff: 'col3' != 'col2'

E         Right contains one more item: 'col3'

E         Full diff:

E         - ['col2', 'col3']

E         + ['col3']
tests/unit/ops/test_drop_low_cardinality.py:45: AssertionError

_____________________ test_cpu_workflow[True-True-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0')

df =      name-cat name-string    id  label         x         y

0      Ingrid      Hannah  1031    999 -0.076963  0.314008

...la  1062   1029  0.995636  0.555042

4320  Charlie         Dan   992    976 -0.958343  0.245327
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f5ff87cc1c0>, cpu = True

engine = 'parquet', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________________ test_cpu_workflow[True-True-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0')

df =      name-string    id  label         x         y

0         Hannah  1031    999 -0.076963  0.314008

1          Sarah  ...     Ursula  1062   1029  0.995636  0.555042

2160         Dan   992    976 -0.958343  0.245327
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f601464dbe0>, cpu = True

engine = 'csv', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_cpu_workflow[True-True-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1')

df =      name-string    id  label         x         y

0         Hannah  1031    999 -0.076963  0.314008

1          Sarah  ...     Ursula  1062   1029  0.995636  0.555042

2160         Dan   992    976 -0.958343  0.245327
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f60642f3100>, cpu = True

engine = 'csv-no-header', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

____________________ test_cpu_workflow[True-False-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0')

df =      name-cat name-string    id  label         x         y

0      Ingrid      Hannah  1031    999 -0.076963  0.314008

...la  1062   1029  0.995636  0.555042

4320  Charlie         Dan   992    976 -0.958343  0.245327
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f6014783a30>, cpu = True

engine = 'parquet', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

______________________ test_cpu_workflow[True-False-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0')

df =      name-string    id  label         x         y

0         Hannah  1031    999 -0.076963  0.314008

1          Sarah  ...     Ursula  1062   1029  0.995636  0.555042

2160         Dan   992    976 -0.958343  0.245327
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f6014644790>, cpu = True

engine = 'csv', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_________________ test_cpu_workflow[True-False-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1')

df =      name-string    id  label         x         y

0         Hannah  1031    999 -0.076963  0.314008

1          Sarah  ...     Ursula  1062   1029  0.995636  0.555042

2160         Dan   992    976 -0.958343  0.245327
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f6014744df0>, cpu = True

engine = 'csv-no-header', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-0-csv-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py...

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header]

FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions....

FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp...

FAILED tests/unit/ops/test_drop_low_cardinality.py::test_drop_low_cardinality[True]

FAILED tests/unit/ops/test_drop_low_cardinality.py::test_drop_low_cardinality[False]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header]

===== 19 failed, 1412 passed, 1 skipped, 617 warnings in 747.37s (0:12:27) =====

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins11395841751843227978.sh

nvidia-merlin-bot · 2022-08-09T14:48:15Z

Click to view CI Results

GitHub pull request #1641 of commit 729eb88f3ebd2064c0eea2acb040ed23aa0e5191, no merge conflicts.
Running as SYSTEM
Setting status of 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4616/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 729eb88f3ebd2064c0eea2acb040ed23aa0e5191^{commit} # timeout=10
Checking out Revision 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 # timeout=10
Commit message: "Update `DropLowCardinality` to handle changes to `Categorify` domain"
 > git rev-list --no-walk 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins1109161135988901750.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items
tests/unit/test_dask_nvt.py ............................F..F......F..F.. [  3%]

..................................................................F.F... [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py FF                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  9%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 21%]

........................................s..                              [ 24%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 26%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py FFFFFF                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

________ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-0.1] ________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr28')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv'

cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:36:35,353 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-2ff0376c0374f06523b9f25395b72dfc', 1)

Function:  subgraph_callable-0d5ad759-7370-49ea-a9f7-33f00b22

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr28/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__________ test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] __________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr31')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None

on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:36:37,385 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-a5a644fc8c79cdf9ae2635ed2b300f6c', 1)

Function:  subgraph_callable-62f18cdd-3485-404a-8218-65bd48c6

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr31/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-False-device-0-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr38')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:36:41,594 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-e664934af21c7d272636a2d73892785d', 0)

Function:  subgraph_callable-d16b9c8a-2683-4a79-84d1-534bcf89

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr38/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr41')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:36:43,398 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-94ec7dad78e51dd2f6113a2a4ddd9178', 0)

Function:  subgraph_callable-30db34bb-e4a3-4f14-af64-4e18b807

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr41/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________ test_dask_preproc_cpu[True-None-parquet] ___________________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

engine = 'parquet', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-09 14:37:28,440 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 10)

Function:  subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_2.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-09 14:37:28,445 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 11)

Function:  subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_2.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown ---------------------------

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������2022-08-09 14:37:28,450 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 15)

Function:  subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

engine = 'csv-no-header', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-09 14:37:29,740 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-51ae06915442aa05c68392572c80ee96', 12)

Function:  subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:37:29,741 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-51ae06915442aa05c68392572c80ee96', 13)

Function:  subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:37:29,746 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-51ae06915442aa05c68392572c80ee96', 15)

Function:  subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________________ test_s3_dataset[parquet] ___________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7fe4907e4370>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe561827220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/parquet', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>

_stacktrace = <traceback object at 0x7fe457dc5b00>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe561827220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

method = 'PUT', url = '/parquet'

timeout = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe457beb2e0>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>

data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-18/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-18/parquet0/dataset-1.parquet']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

engine = 'parquet'

df =      name-cat name-string    id  label         x         y

0       Alice      Victor   973    995 -0.613973 -0.434246

...dy   964   1065 -0.263394 -0.013804

4320    Jerry      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 6 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7fe4907e4370>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

---------------------------- Captured stderr setup -----------------------------

Traceback (most recent call last):

File "/usr/local/bin/moto_server", line 5, in 

from moto.server import main

File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in 

from moto.moto_server.werkzeug_app import (

File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in 

from flask import Flask

File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in 

from . import json as json

File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in 

from ..globals import current_app

File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in 

app_ctx: "AppContext" = LocalProxy(  # type: ignore[assignment]

TypeError: init() got an unexpected keyword argument 'unbound_message'

_____________________________ test_s3_dataset[csv] _____________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7fe45554bdf0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe561827220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/csv', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>

_stacktrace = <traceback object at 0x7fe457dbb880>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe561827220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

method = 'PUT', url = '/csv'

timeout = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48d102f70>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>

data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-18/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-18/csv0/dataset-1.csv']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')}

engine = 'csv'

df =      name-string    id  label         x         y

0         Victor   973    995 -0.613973 -0.434246

1            Bob  ...      Wendy   964   1065 -0.263394 -0.013804

2160      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 5 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7fe45554bdf0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

_____________________ test_cpu_workflow[True-True-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0')

df =      name-cat name-string    id  label         x         y

0       Alice      Victor   973    995 -0.613973 -0.434246

...dy   964   1065 -0.263394 -0.013804

4320    Jerry      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe42818a0a0>, cpu = True

engine = 'parquet', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________________ test_cpu_workflow[True-True-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0')

df =      name-string    id  label         x         y

0         Victor   973    995 -0.613973 -0.434246

1            Bob  ...      Wendy   964   1065 -0.263394 -0.013804

2160      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe40c68ee50>, cpu = True

engine = 'csv', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_cpu_workflow[True-True-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1')

df =      name-string    id  label         x         y

0         Victor   973    995 -0.613973 -0.434246

1            Bob  ...      Wendy   964   1065 -0.263394 -0.013804

2160      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe45514a1c0>, cpu = True

engine = 'csv-no-header', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

____________________ test_cpu_workflow[True-False-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0')

df =      name-cat name-string    id  label         x         y

0       Alice      Victor   973    995 -0.613973 -0.434246

...dy   964   1065 -0.263394 -0.013804

4320    Jerry      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe4890b4160>, cpu = True

engine = 'parquet', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

______________________ test_cpu_workflow[True-False-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0')

df =      name-string    id  label         x         y

0         Victor   973    995 -0.613973 -0.434246

1            Bob  ...      Wendy   964   1065 -0.263394 -0.013804

2160      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe40c2770d0>, cpu = True

engine = 'csv', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_________________ test_cpu_workflow[True-False-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1')

df =      name-string    id  label         x         y

0         Victor   973    995 -0.613973 -0.434246

1            Bob  ...      Wendy   964   1065 -0.263394 -0.013804

2160      Ursula   970   1009 -0.394831 -0.651957
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe4280583d0>, cpu = True

engine = 'csv-no-header', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-0-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header]

FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions....

FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp...

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header]

===== 14 failed, 1417 passed, 1 skipped, 617 warnings in 722.15s (0:12:02) =====

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins2653689168443043854.sh

github-actions · 2022-08-09T14:53:54Z

Documentation preview

https://nvidia-merlin.github.io/NVTabular/review/pr-1641

oliverholworthy · 2022-08-15T13:14:07Z

rerun tests

nvidia-merlin-bot · 2022-08-15T16:13:10Z

Click to view CI Results

GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
Running as SYSTEM
Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4629/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10
Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
Commit message: "Merge branch 'main' into categorify-domain-max"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins5697026500764221364.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py Build timed out (after 60 minutes). Marking the build as failed.

Build was aborted

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins6758279656828584530.sh

karlhigley · 2022-08-15T16:18:20Z

rerun tests

nvidia-merlin-bot · 2022-08-15T16:33:43Z

Click to view CI Results

GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
Running as SYSTEM
Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4630/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10
Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
Commit message: "Merge branch 'main' into categorify-domain-max"
 > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins11669948025439148038.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

..................................................F                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

___________________________ test_multihot_empty_rows ___________________________
def test_multihot_empty_rows():
    multi_hot = tf.feature_column.categorical_column_with_identity("multihot", 5)
    multi_hot_embedding = tf.feature_column.embedding_column(multi_hot, 8, combiner="sum")

    embedding_layer = layers.DenseFeatures([multi_hot_embedding])
    inputs = {
        "multihot": (
            tf.keras.Input(name="multihot__values", shape=(1,), dtype=tf.int64),
            tf.keras.Input(name="multihot__nnzs", shape=(1,), dtype=tf.int64),
        )
    }
    output = embedding_layer(inputs)

    model = tf.keras.Model(inputs=inputs, outputs=output)
    model.compile("sgd", "binary_crossentropy")

    multi_hot_values = np.array([0, 2, 1, 4, 1, 3, 1])
    multi_hot_nnzs = np.array([1, 0, 2, 4, 0])
    x = {"multihot": (multi_hot_values[:, None], multi_hot_nnzs[:, None])}

    multi_hot_embedding_table = embedding_layer.embedding_tables["multihot"].numpy()
    multi_hot_embedding_rows = _compute_expected_multi_hot(
        multi_hot_embedding_table, multi_hot_values, multi_hot_nnzs, "sum"
    )

    y_hat = model(x).numpy()


  np.testing.assert_allclose(y_hat, multi_hot_embedding_rows, rtol=1e-06)


E       AssertionError:

E       Not equal to tolerance rtol=1e-06, atol=0

E

E       Mismatched elements: 1 / 40 (2.5%)

E       Max absolute difference: 1.1920929e-07

E       Max relative difference: 1.502241e-06

E        x: array([[-0.29789 , -0.016212, -0.051031, -0.248089,  0.250163, -0.30276 ,

E               -0.253522, -0.074231],

E              [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,...

E        y: array([[-0.29789 , -0.016212, -0.051031, -0.248089,  0.250163, -0.30276 ,

E               -0.253522, -0.074231],

E              [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,...
tests/unit/framework_utils/test_tf_layers.py:321: AssertionError

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/framework_utils/test_tf_layers.py::test_multihot_empty_rows

===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 722.63s (0:12:02) ======

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins4753025854975341293.sh

karlhigley · 2022-08-15T16:37:04Z

rerun tests

nvidia-merlin-bot · 2022-08-15T16:50:53Z

Click to view CI Results

GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts.
Running as SYSTEM
Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4631/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10
 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10
Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
Commit message: "Merge branch 'main' into categorify-domain-max"
 > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins9182884185066325902.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

========== 1429 passed, 2 skipped, 618 warnings in 709.60s (0:11:49) ===========

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins5371199130175045927.sh

Update Categorify operator to set the domain max correctly

25ee26c

oliverholworthy added the bug Something isn't working label Aug 9, 2022

oliverholworthy added this to the Merlin 22.08 milestone Aug 9, 2022

oliverholworthy self-assigned this Aug 9, 2022

karlhigley approved these changes Aug 9, 2022

View reviewed changes

Update DropLowCardinality to handle changes to Categorify domain

729eb88

oliverholworthy mentioned this pull request Aug 10, 2022

[Task] Use embedding_sizes.cardinality instead of int_domain to compute embedding cardinality NVIDIA-Merlin/models#636

Open

Merge branch 'main' into categorify-domain-max

5e149c8

karlhigley merged commit 934a326 into NVIDIA-Merlin:main Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the `Categorify` operator to set the domain max correctly #1641

Update the `Categorify` operator to set the domain max correctly #1641

oliverholworthy commented Aug 9, 2022 •

edited

Loading

nvidia-merlin-bot commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

oliverholworthy commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

Update the Categorify operator to set the domain max correctly #1641

Update the Categorify operator to set the domain max correctly #1641

Conversation

oliverholworthy commented Aug 9, 2022 • edited Loading

Goal

Motivation / Context

Example

nvidia-merlin-bot commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

Documentation preview

oliverholworthy commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

Update the `Categorify` operator to set the domain max correctly #1641

Update the `Categorify` operator to set the domain max correctly #1641

oliverholworthy commented Aug 9, 2022 •

edited

Loading