Fix groupby on lists with cudf 22.06+ #1654

benfred · 2022-08-22T22:16:36Z

Groupby unittests are failing on cudf 22.06+ with an error like

FAILED tests/unit/ops/test_groupyby.py::test_groupby_op[id-True-False] - TypeError: 'NumericalColumn' object is not subscriptable

Fix.

Groupby unittests are failing on cudf 22.06+ with an error like ``` FAILED tests/unit/ops/test_groupyby.py::test_groupby_op[id-True-False] - TypeError: 'NumericalColumn' object is not subscriptable ``` Fix.

nvidia-merlin-bot · 2022-08-22T22:29:17Z

Click to view CI Results

GitHub pull request #1654 of commit 07a9a2b80411d197d0c715b23a2aa7601859c267, no merge conflicts.
Running as SYSTEM
Setting status of 07a9a2b80411d197d0c715b23a2aa7601859c267 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4639/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1654/*:refs/remotes/origin/pr/1654/* # timeout=10
 > git rev-parse 07a9a2b80411d197d0c715b23a2aa7601859c267^{commit} # timeout=10
Checking out Revision 07a9a2b80411d197d0c715b23a2aa7601859c267 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 07a9a2b80411d197d0c715b23a2aa7601859c267 # timeout=10
Commit message: "Fix groupby on lists with cudf 22.06+"
 > git rev-list --no-walk 02a93eebfca6a825c00bf8c2d0b91863ec0150e4 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins17991720946688651629.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py F                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

_________________________________ test_tf4rec __________________________________
def test_tf4rec():
    inputs = {
        "user_session": np.random.randint(1, 10000, NUM_ROWS),
        "product_id": np.random.randint(1, 51996, NUM_ROWS),
        "category_id": np.random.randint(0, 332, NUM_ROWS),
        "event_time_ts": np.random.randint(1570373000, 1670373390, NUM_ROWS),
        "prod_first_event_time_ts": np.random.randint(1570373000, 1570373382, NUM_ROWS),
        "price": np.random.uniform(0, 2750, NUM_ROWS),
    }
    df = make_df(inputs)

    # categorify features

    cat_feats = (
        ["user_session", "product_id", "category_id"]
        >> nvt.ops.Categorify()
        >> nvt.ops.LambdaOp(lambda col: col + 1)
    )

    # create time features
    sessionTs = ["event_time_ts"]

    sessionTime = (
        sessionTs
        >> nvt.ops.LambdaOp(lambda col: to_datetime(col, unit="s"))
        >> nvt.ops.Rename(name="event_time_dt")
    )

    sessionTime_weekday = (
        sessionTime
        >> nvt.ops.LambdaOp(lambda col: col.dt.weekday)
        >> nvt.ops.Rename(name="et_dayofweek")
    )

    def get_cycled_feature_value_sin(col, max_value):
        value_scaled = (col + 0.000001) / max_value
        value_sin = np.sin(2 * np.pi * value_scaled)
        return value_sin

    def get_cycled_feature_value_cos(col, max_value):
        value_scaled = (col + 0.000001) / max_value
        value_cos = np.cos(2 * np.pi * value_scaled)
        return value_cos

    weekday_sin = (
        sessionTime_weekday
        >> (lambda col: get_cycled_feature_value_sin(col + 1, 7))
        >> nvt.ops.Rename(name="et_dayofweek_sin")
    )
    weekday_cos = (
        sessionTime_weekday
        >> (lambda col: get_cycled_feature_value_cos(col + 1, 7))
        >> nvt.ops.Rename(name="et_dayofweek_cos")
    )
    from nvtabular.ops import Operator

    # custom op for item recency
    class ItemRecency(Operator):
        def transform(self, columns, gdf):
            for column in columns.names:
                col = gdf[column]
                item_first_timestamp = gdf["prod_first_event_time_ts"]
                delta_days = (col - item_first_timestamp) / (60 * 60 * 24)
                gdf[column + "_age_days"] = delta_days * (delta_days >= 0)
            return gdf

        def compute_selector(
            self,
            input_schema: Schema,
            selector: ColumnSelector,
            parents_selector: ColumnSelector,
            dependencies_selector: ColumnSelector,
        ) -> ColumnSelector:
            self._validate_matching_cols(input_schema, parents_selector, "computing input selector")
            return parents_selector

        def column_mapping(self, col_selector):
            column_mapping = {}
            for col_name in col_selector.names:
                column_mapping[col_name + "_age_days"] = [col_name]
            return column_mapping

        @property
        def dependencies(self):
            return ["prod_first_event_time_ts"]

        @property
        def output_dtype(self):
            return np.float64

    recency_features = ["event_time_ts"] >> ItemRecency()
    recency_features_norm = (
        recency_features
        >> nvt.ops.LogOp()
        >> nvt.ops.Normalize()
        >> nvt.ops.Rename(name="product_recency_days_log_norm")
    )

    time_features = (
        sessionTime + sessionTime_weekday + weekday_sin + weekday_cos + recency_features_norm
    )

    # Smoothing price long-tailed distribution
    price_log = (
        ["price"] >> nvt.ops.LogOp() >> nvt.ops.Normalize() >> nvt.ops.Rename(name="price_log_norm")
    )

    # Relative Price to the average price for the category_id
    def relative_price_to_avg_categ(col, gdf):
        epsilon = 1e-5
        col = ((gdf["price"] - col) / (col + epsilon)) * (col > 0).astype(int)
        return col

    avg_category_id_pr = (
        ["category_id"]
        >> nvt.ops.JoinGroupby(cont_cols=["price"], stats=["mean"])
        >> nvt.ops.Rename(name="avg_category_id_price")
    )
    relative_price_to_avg_category = (
        avg_category_id_pr
        >> nvt.ops.LambdaOp(relative_price_to_avg_categ, dependency=["price"])
        >> nvt.ops.Rename(name="relative_price_to_avg_categ_id")
    )

    groupby_feats = (
        ["event_time_ts"] + cat_feats + time_features + price_log + relative_price_to_avg_category
    )

    # Define Groupby Workflow
    groupby_features = groupby_feats >> nvt.ops.Groupby(
        groupby_cols=["user_session"],
        sort_cols=["event_time_ts"],
        aggs={
            "product_id": ["list", "count"],
            "category_id": ["list"],
            "event_time_dt": ["first"],
            "et_dayofweek_sin": ["list"],
            "et_dayofweek_cos": ["list"],
            "price_log_norm": ["list"],
            "relative_price_to_avg_categ_id": ["list"],
            "product_recency_days_log_norm": ["list"],
        },
        name_sep="-",
    )

    SESSIONS_MAX_LENGTH = 20
    MINIMUM_SESSION_LENGTH = 2

    groupby_features_nonlist = groupby_features["user_session", "product_id-count"]

    groupby_features_list = groupby_features[
        "price_log_norm-list",
        "product_recency_days_log_norm-list",
        "et_dayofweek_sin-list",
        "et_dayofweek_cos-list",
        "product_id-list",
        "category_id-list",
        "relative_price_to_avg_categ_id-list",
    ]

    groupby_features_trim = (
        groupby_features_list
        >> nvt.ops.ListSlice(0, SESSIONS_MAX_LENGTH)
        >> nvt.ops.Rename(postfix="_seq")
    )

    # calculate session day index based on 'event_time_dt-first' column
    day_index = (
        (groupby_features["event_time_dt-first"])
        >> nvt.ops.LambdaOp(lambda col: (col - col.min()).dt.days + 1)
        >> nvt.ops.Rename(f=lambda col: "day_index")
    )

    selected_features = groupby_features_nonlist + groupby_features_trim + day_index

    filtered_sessions = selected_features >> nvt.ops.Filter(
        f=lambda df: df["product_id-count"] >= MINIMUM_SESSION_LENGTH
    )

    dataset = nvt.Dataset(df)

    workflow = nvt.Workflow(filtered_sessions)


  workflow.fit(dataset)


tests/unit/test_tf4rec.py:198:

nvtabular/workflow/workflow.py:209: in fit

self._transform_impl(dataset, capture_dtypes=True).sample_dtypes()

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:1147: in sample_dtypes

_real_meta = self.engine.sample_data(n=n)

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset_engine.py:71: in sample_data

_head = _ddf.partitions[partition_index].head(n)

/usr/local/lib/python3.8/dist-packages/dask/dataframe/core.py:1140: in head

return self._head(n=n, npartitions=npartitions, compute=compute, safe=safe)

/usr/local/lib/python3.8/dist-packages/dask/dataframe/core.py:1174: in _head

result = result.compute()

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/utils.py:40: in apply

return func(*args, **kwargs)

nvtabular/workflow/executor.py:56: in apply

parent_df = self.apply(df, [parent], capture_dtypes=capture_dtypes)

nvtabular/workflow/executor.py:56: in apply

parent_df = self.apply(df, [parent], capture_dtypes=capture_dtypes)

nvtabular/workflow/executor.py:56: in apply

parent_df = self.apply(df, [parent], capture_dtypes=capture_dtypes)

nvtabular/workflow/executor.py:85: in apply

output_df = node.op.transform(selection, input_df)

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

nvtabular/ops/groupby.py:132: in transform

new_df = _apply_aggs(

nvtabular/ops/groupby.py:245: in _apply_aggs

df[f"{col}{name_sep}{_agg}"] = _first_or_last(

nvtabular/ops/groupby.py:289: in _first_or_last

return _first(x)

nvtabular/ops/groupby.py:302: in _first

elements = x.list._column.elements.values

self = <cudf.core.column.datetime.DatetimeColumn object at 0x7f0e9286adc0>

[

2020-04-13 03:23:42,

2020-07-07 09:28:00,

...3:02,

2022-11-15 19:12:56,

2021-11-22 15:34:01,

2020-04-14 20:22:53,

2020-03-15 06:11:35

]

dtype: datetime64[s]
@property
def values(self):
    """
    Return a CuPy representation of the DateTimeColumn.
    """


  raise NotImplementedError(


        "DateTime Arrays is not yet implemented in cudf"
    )

E       NotImplementedError: DateTime Arrays is not yet implemented in cudf
/usr/local/lib/python3.8/dist-packages/cudf/core/column/datetime.py:210: NotImplementedError

------------------------------ Captured log call -------------------------------

ERROR    nvtabular:executor.py:111 Failed to transform operator <nvtabular.ops.groupby.Groupby object at 0x7f0e9280a490>

Traceback (most recent call last):

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/workflow/executor.py", line 85, in apply

output_df = node.op.transform(selection, input_df)

File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner

result = func(*args, **kwargs)

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 132, in transform

new_df = _apply_aggs(

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 245, in _apply_aggs

df[f"{col}{name_sep}{_agg}"] = _first_or_last(

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 289, in _first_or_last

return _first(x)

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 302, in _first

elements = x.list._column.elements.values

File "/usr/local/lib/python3.8/dist-packages/cudf/core/column/datetime.py", line 210, in values

raise NotImplementedError(

NotImplementedError: DateTime Arrays is not yet implemented in cudf

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_tf4rec.py::test_tf4rec - NotImplementedError: DateTime...

===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 709.98s (0:11:49) ======

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins16832703381548470182.sh

nvidia-merlin-bot · 2022-08-22T23:23:43Z

Click to view CI Results

GitHub pull request #1654 of commit fcf24b4d7c36a29975a5db431e9ac7ebe25a6acc, no merge conflicts.
Running as SYSTEM
Setting status of fcf24b4d7c36a29975a5db431e9ac7ebe25a6acc to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4643/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1654/*:refs/remotes/origin/pr/1654/* # timeout=10
 > git rev-parse fcf24b4d7c36a29975a5db431e9ac7ebe25a6acc^{commit} # timeout=10
Checking out Revision fcf24b4d7c36a29975a5db431e9ac7ebe25a6acc (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f fcf24b4d7c36a29975a5db431e9ac7ebe25a6acc # timeout=10
Commit message: "Merge branch 'main' into fix_groupby"
 > git rev-list --no-walk 852ee8f53df6c4c5aa10a0d98e293cbd30e1bdef # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins8251550900149403624.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py F                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

_________________________________ test_tf4rec __________________________________
def test_tf4rec():
    inputs = {
        "user_session": np.random.randint(1, 10000, NUM_ROWS),
        "product_id": np.random.randint(1, 51996, NUM_ROWS),
        "category_id": np.random.randint(0, 332, NUM_ROWS),
        "event_time_ts": np.random.randint(1570373000, 1670373390, NUM_ROWS),
        "prod_first_event_time_ts": np.random.randint(1570373000, 1570373382, NUM_ROWS),
        "price": np.random.uniform(0, 2750, NUM_ROWS),
    }
    df = make_df(inputs)

    # categorify features

    cat_feats = (
        ["user_session", "product_id", "category_id"]
        >> nvt.ops.Categorify()
        >> nvt.ops.LambdaOp(lambda col: col + 1)
    )

    # create time features
    sessionTs = ["event_time_ts"]

    sessionTime = (
        sessionTs
        >> nvt.ops.LambdaOp(lambda col: to_datetime(col, unit="s"))
        >> nvt.ops.Rename(name="event_time_dt")
    )

    sessionTime_weekday = (
        sessionTime
        >> nvt.ops.LambdaOp(lambda col: col.dt.weekday)
        >> nvt.ops.Rename(name="et_dayofweek")
    )

    def get_cycled_feature_value_sin(col, max_value):
        value_scaled = (col + 0.000001) / max_value
        value_sin = np.sin(2 * np.pi * value_scaled)
        return value_sin

    def get_cycled_feature_value_cos(col, max_value):
        value_scaled = (col + 0.000001) / max_value
        value_cos = np.cos(2 * np.pi * value_scaled)
        return value_cos

    weekday_sin = (
        sessionTime_weekday
        >> (lambda col: get_cycled_feature_value_sin(col + 1, 7))
        >> nvt.ops.Rename(name="et_dayofweek_sin")
    )
    weekday_cos = (
        sessionTime_weekday
        >> (lambda col: get_cycled_feature_value_cos(col + 1, 7))
        >> nvt.ops.Rename(name="et_dayofweek_cos")
    )
    from nvtabular.ops import Operator

    # custom op for item recency
    class ItemRecency(Operator):
        def transform(self, columns, gdf):
            for column in columns.names:
                col = gdf[column]
                item_first_timestamp = gdf["prod_first_event_time_ts"]
                delta_days = (col - item_first_timestamp) / (60 * 60 * 24)
                gdf[column + "_age_days"] = delta_days * (delta_days >= 0)
            return gdf

        def compute_selector(
            self,
            input_schema: Schema,
            selector: ColumnSelector,
            parents_selector: ColumnSelector,
            dependencies_selector: ColumnSelector,
        ) -> ColumnSelector:
            self._validate_matching_cols(input_schema, parents_selector, "computing input selector")
            return parents_selector

        def column_mapping(self, col_selector):
            column_mapping = {}
            for col_name in col_selector.names:
                column_mapping[col_name + "_age_days"] = [col_name]
            return column_mapping

        @property
        def dependencies(self):
            return ["prod_first_event_time_ts"]

        @property
        def output_dtype(self):
            return np.float64

    recency_features = ["event_time_ts"] >> ItemRecency()
    recency_features_norm = (
        recency_features
        >> nvt.ops.LogOp()
        >> nvt.ops.Normalize()
        >> nvt.ops.Rename(name="product_recency_days_log_norm")
    )

    time_features = (
        sessionTime + sessionTime_weekday + weekday_sin + weekday_cos + recency_features_norm
    )

    # Smoothing price long-tailed distribution
    price_log = (
        ["price"] >> nvt.ops.LogOp() >> nvt.ops.Normalize() >> nvt.ops.Rename(name="price_log_norm")
    )

    # Relative Price to the average price for the category_id
    def relative_price_to_avg_categ(col, gdf):
        epsilon = 1e-5
        col = ((gdf["price"] - col) / (col + epsilon)) * (col > 0).astype(int)
        return col

    avg_category_id_pr = (
        ["category_id"]
        >> nvt.ops.JoinGroupby(cont_cols=["price"], stats=["mean"])
        >> nvt.ops.Rename(name="avg_category_id_price")
    )
    relative_price_to_avg_category = (
        avg_category_id_pr
        >> nvt.ops.LambdaOp(relative_price_to_avg_categ, dependency=["price"])
        >> nvt.ops.Rename(name="relative_price_to_avg_categ_id")
    )

    groupby_feats = (
        ["event_time_ts"] + cat_feats + time_features + price_log + relative_price_to_avg_category
    )

    # Define Groupby Workflow
    groupby_features = groupby_feats >> nvt.ops.Groupby(
        groupby_cols=["user_session"],
        sort_cols=["event_time_ts"],
        aggs={
            "product_id": ["list", "count"],
            "category_id": ["list"],
            "event_time_dt": ["first"],
            "et_dayofweek_sin": ["list"],
            "et_dayofweek_cos": ["list"],
            "price_log_norm": ["list"],
            "relative_price_to_avg_categ_id": ["list"],
            "product_recency_days_log_norm": ["list"],
        },
        name_sep="-",
    )

    SESSIONS_MAX_LENGTH = 20
    MINIMUM_SESSION_LENGTH = 2

    groupby_features_nonlist = groupby_features["user_session", "product_id-count"]

    groupby_features_list = groupby_features[
        "price_log_norm-list",
        "product_recency_days_log_norm-list",
        "et_dayofweek_sin-list",
        "et_dayofweek_cos-list",
        "product_id-list",
        "category_id-list",
        "relative_price_to_avg_categ_id-list",
    ]

    groupby_features_trim = (
        groupby_features_list
        >> nvt.ops.ListSlice(0, SESSIONS_MAX_LENGTH)
        >> nvt.ops.Rename(postfix="_seq")
    )

    # calculate session day index based on 'event_time_dt-first' column
    day_index = (
        (groupby_features["event_time_dt-first"])
        >> nvt.ops.LambdaOp(lambda col: (col - col.min()).dt.days + 1)
        >> nvt.ops.Rename(f=lambda col: "day_index")
    )

    selected_features = groupby_features_nonlist + groupby_features_trim + day_index

    filtered_sessions = selected_features >> nvt.ops.Filter(
        f=lambda df: df["product_id-count"] >= MINIMUM_SESSION_LENGTH
    )

    dataset = nvt.Dataset(df)

    workflow = nvt.Workflow(filtered_sessions)


  workflow.fit(dataset)


tests/unit/test_tf4rec.py:198:

nvtabular/workflow/workflow.py:209: in fit

self._transform_impl(dataset, capture_dtypes=True).sample_dtypes()

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:1147: in sample_dtypes

_real_meta = self.engine.sample_data(n=n)

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset_engine.py:71: in sample_data

_head = _ddf.partitions[partition_index].head(n)

/usr/local/lib/python3.8/dist-packages/dask/dataframe/core.py:1140: in head

return self._head(n=n, npartitions=npartitions, compute=compute, safe=safe)

/usr/local/lib/python3.8/dist-packages/dask/dataframe/core.py:1174: in _head

result = result.compute()

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/utils.py:40: in apply

return func(*args, **kwargs)

nvtabular/workflow/executor.py:56: in apply

parent_df = self.apply(df, [parent], capture_dtypes=capture_dtypes)

nvtabular/workflow/executor.py:56: in apply

parent_df = self.apply(df, [parent], capture_dtypes=capture_dtypes)

nvtabular/workflow/executor.py:56: in apply

parent_df = self.apply(df, [parent], capture_dtypes=capture_dtypes)

nvtabular/workflow/executor.py:85: in apply

output_df = node.op.transform(selection, input_df)

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

nvtabular/ops/groupby.py:132: in transform

new_df = _apply_aggs(

nvtabular/ops/groupby.py:245: in _apply_aggs

df[f"{col}{name_sep}{_agg}"] = _first_or_last(

nvtabular/ops/groupby.py:289: in _first_or_last

return _first(x)

nvtabular/ops/groupby.py:302: in _first

elements = x.list._column.elements.values

self = <cudf.core.column.datetime.DatetimeColumn object at 0x7f01081eb3c0>

[

2019-12-18 10:54:26,

2020-01-16 05:25:57,

...7:31,

2020-05-22 12:09:32,

2020-11-11 23:04:27,

2020-09-12 05:35:17,

2020-12-26 04:10:41

]

dtype: datetime64[s]
@property
def values(self):
    """
    Return a CuPy representation of the DateTimeColumn.
    """


  raise NotImplementedError(


        "DateTime Arrays is not yet implemented in cudf"
    )

E       NotImplementedError: DateTime Arrays is not yet implemented in cudf
/usr/local/lib/python3.8/dist-packages/cudf/core/column/datetime.py:210: NotImplementedError

------------------------------ Captured log call -------------------------------

ERROR    nvtabular:executor.py:111 Failed to transform operator <nvtabular.ops.groupby.Groupby object at 0x7f01381c2f40>

Traceback (most recent call last):

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/workflow/executor.py", line 85, in apply

output_df = node.op.transform(selection, input_df)

File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner

result = func(*args, **kwargs)

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 132, in transform

new_df = _apply_aggs(

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 245, in _apply_aggs

df[f"{col}{name_sep}{_agg}"] = _first_or_last(

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 289, in _first_or_last

return _first(x)

File "/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/ops/groupby.py", line 302, in _first

elements = x.list._column.elements.values

File "/usr/local/lib/python3.8/dist-packages/cudf/core/column/datetime.py", line 210, in values

raise NotImplementedError(

NotImplementedError: DateTime Arrays is not yet implemented in cudf

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_tf4rec.py::test_tf4rec - NotImplementedError: DateTime...

===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 706.13s (0:11:46) ======

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins9948493744242059586.sh

github-actions · 2022-08-22T23:30:06Z

Documentation preview

https://nvidia-merlin.github.io/NVTabular/review/pr-1654

nvidia-merlin-bot · 2022-08-23T01:29:44Z

Click to view CI Results

GitHub pull request #1654 of commit 7ff98a4f2b978dcc5c1c3dcad001122eea28e52d, no merge conflicts.
Running as SYSTEM
Setting status of 7ff98a4f2b978dcc5c1c3dcad001122eea28e52d to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4644/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1654/*:refs/remotes/origin/pr/1654/* # timeout=10
 > git rev-parse 7ff98a4f2b978dcc5c1c3dcad001122eea28e52d^{commit} # timeout=10
Checking out Revision 7ff98a4f2b978dcc5c1c3dcad001122eea28e52d (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7ff98a4f2b978dcc5c1c3dcad001122eea28e52d # timeout=10
Commit message: "fix tf4rec unittest"
 > git rev-list --no-walk fcf24b4d7c36a29975a5db431e9ac7ebe25a6acc # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins15873753808315941829.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

========== 1429 passed, 2 skipped, 618 warnings in 694.75s (0:11:34) ===========

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins13380384454792578178.sh

Fix groupby on lists with cudf 22.06+

07a9a2b

Groupby unittests are failing on cudf 22.06+ with an error like ``` FAILED tests/unit/ops/test_groupyby.py::test_groupby_op[id-True-False] - TypeError: 'NumericalColumn' object is not subscriptable ``` Fix.

benfred added the bug Something isn't working label Aug 22, 2022

benfred added this to the Merlin 22.08 milestone Aug 22, 2022

Merge branch 'main' into fix_groupby

fcf24b4

fix tf4rec unittest

7ff98a4

jperez999 approved these changes Aug 23, 2022

View reviewed changes

benfred merged commit 37e9b7d into main Aug 23, 2022

benfred deleted the fix_groupby branch August 23, 2022 02:27

benfred mentioned this pull request Sep 1, 2022

[BUG] Error running ETL with NVTabular tutorial notebook -- version compatibility issue? NVIDIA-Merlin/Transformers4Rec#480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix groupby on lists with cudf 22.06+ #1654

Fix groupby on lists with cudf 22.06+ #1654

benfred commented Aug 22, 2022

nvidia-merlin-bot commented Aug 22, 2022

nvidia-merlin-bot commented Aug 22, 2022

github-actions bot commented Aug 22, 2022

nvidia-merlin-bot commented Aug 23, 2022

Fix groupby on lists with cudf 22.06+ #1654

Fix groupby on lists with cudf 22.06+ #1654

Conversation

benfred commented Aug 22, 2022

nvidia-merlin-bot commented Aug 22, 2022

nvidia-merlin-bot commented Aug 22, 2022

github-actions bot commented Aug 22, 2022

Documentation preview

nvidia-merlin-bot commented Aug 23, 2022