Clean up the way shapes are computed and specified #1760

karlhigley · 2023-02-13T16:06:32Z

This replaces shapes specified implicitly via the is_list/is_ragged flag or via the value_count property with more explicitly computed and specified shapes.

Depends on NVIDIA-Merlin/core#215

… refactor/shape-computation

oliverholworthy · 2023-02-13T19:32:16Z

nvtabular/ops/groupby.py

@@ -18,6 +18,7 @@
 from dask.dataframe.utils import meta_nonempty

 from merlin.core.dispatch import DataFrameType, annotate
+from merlin.dtypes.shape import DefaultShapes


Is this going to be a new feature of dtypes in core?

Shapes are implemented as a subfield of dtypes in Core, but this isn't really intended to be a feature of dtypes in particular. We might want to hide that implementation detail a bit more thoroughly by adjusting the imports.

As far as the defaults go, we just thought it was easier to read and understand than needing to remember which shapes mean what.

The DefaultShapes.LIST object seems like it could be useful, but the tests are failing to find this in core ImportError: cannot import name 'DefaultShapes' from 'merlin.dtypes.shape

Expected since there is (or was) an outstanding Core PR that adds it

Looks like it's still open: NVIDIA-Merlin/core#215

oliverholworthy · 2023-02-13T19:36:00Z

nvtabular/ops/list_slice.py

@@ -129,7 +129,7 @@ def transform(self, col_selector: ColumnSelector, df: DataFrameType) -> DataFram

    def _compute_dtype(self, col_schema, input_schema):


Is this method required to be overriden with this change? or could it be removed?

It could be removed, but that breaks some of the tests even though the functionality works without it. (There are actually two places like this.)

oliverholworthy · 2023-02-14T12:57:52Z

nvtabular/loader/backend.py

-    sparse_names=None,
-    sparse_max=None,
-    sparse_as_dense=False,
+    padded_cols=None,


renaming these arguments could be out-of-scope for this PR? Since it may be a breaking change for something that uses this function. It may be clearer to separate this into a different PR.

We couldn't really figure out what this function was supposed to do without the renames, so we went for it. The function name starts with an _ so we've appropriately signaled that external code shouldn't depend on its stability. The two places in the NVTabular that use it call it via argument order instead of specifying the names, so I understand the caution but I think we're okay.

looks like it will be a safe change, and the names make things clearer :). We may consider removing this functon along with the other loader code to follow-up on the promise here in one of the upcoming releases

And by that point we'll hopefully have updated the dataloader API to a point where this augment_schema function is no longer required either here or the copies in Transformers4Rec and in Merlin Models. The existence of this function suggests to me that there's something missing from the dataloader API as it currently exists.

I think the something that's missing is "transforms implemented as operators that provide schema tracking" 😺

oliverholworthy · 2023-02-14T13:06:18Z

nvtabular/ops/groupby.py

+        agg_is_lists = {"list": True}
+
+        agg = self._find_agg(col_schema, input_schema)
+        is_list = agg_is_lists.get(agg, col_schema.is_list)


In the case where we have fallback to the second argument of the .get here and check the col_schema.is_list attribute. If we have a fixed list, would we like to preserve that info in the shape later on instead of turning in into a ragged list which is presumably default shape corresponding to DefaultShapes.LIST

I'm not actually sure. Are there GroupBy aggregations that don't change the shape of list columns?

From looking at the list of aggregations, I think everything changes the shape, either from list to scalar or from scalar to list.

I wonder if the default there should actually be False 🤔

In practice it seems that it may not matter whether it's col_schema.is_list or False. I tried running Groupby with a an agg that is not "list" on a dataframe with list feautres and we get errors in both cudf and pandas

from merlin.io import Dataset import nvtabular as nvt import cudf df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]}) workflow = nvt.Workflow(["a", "b"] >> nvt.ops.Groupby(groupby_cols=["a"], aggs=["sum"])) workflow.fit_transform(Dataset(df)).compute() # => Raises DataError: All requested aggregations are unsupported.

Some of these aggs, like sum or mean could in theory work on list features if we wanted them to.

Pandas for example, handles sum across lists as concatenation.

import pandas as pd df = pd.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]}) df.groupby("a").sum() # => b a 1 [10, 20] 2 [20]

or if numpy arrays, then as an element-wise sum

df = pd.DataFrame({"a": [1, 1, 2], "b": [np.array([10]), np.array([20]), np.array([20])]}) df.groupby("a").sum() # => b a 1 [30] 2 [20]

var/std/mean/median also work and in this example return scalars. If the element type contained an array of more then 1 dimension, then mean could start returning a list type too

since cudf doesn't appear to support aggregating across list columns this is probably something we don't need to be concerned about for now.

import cudf df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]}) df.groupby("a").sum() # => Raises DataError: All requested aggregations are unsupported. df["b"].sum() # => Raises TypeError: cannot perform sum with type list

So I think False could be fine as the default here, and as far as this PR goes, it's no less clear than it was before.

If we need groupby aggregations for list columns as a future feature of NVTabular this will need to be revisited. I suppose even if cudf and pandas don't natively support this we could implement this ourselves by extracting the cupy/numpy arrays from the series in our own agg function to handle list column aggregations.

currently it seems that the only agg that is supported for list columns is list which will result in adding one additional dimension to the shape of the original list

from merlin.io import Dataset import nvtabular as nvt import cudf df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]}) workflow = nvt.Workflow(["a", "b"] >> nvt.ops.Groupby(groupby_cols=["a"], aggs=["list"])) workflow.fit_transform(Dataset(df)).compute() # => a b_list 0 1 [[10], [20]] 1 2 [[20]]

That's great to know; I really appreciate your thoroughness in testing this out. This probably warrants a further update to the shapes here, I'll open a separate issue.

Tracked here: #1763

… refactor/shape-computation

oliverholworthy · 2023-02-15T13:28:38Z

nvtabular/loader/backend.py

-            properties["value_count"] = {"max": sparse_max[col]}
-        if sparse_as_dense:
-            properties["value_count"]["min"] = properties["value_count"]["max"]
+        dims = Shape(((1, batch_size), None))


What is the batch_size required for?

Not required for anything specific, just following the principle that the schema should always accurately reflect the data to the greatest extent possible. Here we have shape information since we know the batch size, so we fill it in in case that helps something downstream. I don't know if it actually will, but it seemed like the right thing to do.

karlhigley added 6 commits February 9, 2023 13:35

Separate shape and dtype computation in GroupBy operator

29b4790

Compute shape instead of properties in ValueCount operator

1fd0f6f

Compute shape instead of properties in ListSlice operator

8b4ed58

Compute shape instead of properties in JoinGroupBy operator

00f1820

Compute shapes instead of is_list in dataloaders

60367e8

Merge remote-tracking branch 'origin/refactor/shape-computation' into…

37bf594

… refactor/shape-computation

karlhigley added clean up chore labels Feb 13, 2023

karlhigley added this to the Merlin 23.03 milestone Feb 13, 2023

karlhigley requested a review from oliverholworthy February 13, 2023 16:06

karlhigley assigned karlhigley and jperez999 Feb 13, 2023

Apply auto-formatting to appease the linter

ff01a15

oliverholworthy reviewed Feb 13, 2023

View reviewed changes

oliverholworthy reviewed Feb 14, 2023

View reviewed changes

Merge branch 'main' into refactor/shape-computation

4b0e26b

karlhigley modified the milestones: Merlin 23.03, Merlin 23.02 Feb 14, 2023

karlhigley added 4 commits February 14, 2023 14:35

Separate dtype and shape computation in JoinGroupBy

f6efae2

Merge remote-tracking branch 'origin/refactor/shape-computation' into…

4d63c85

… refactor/shape-computation

Appease the linter by using a generator

acc9256

Merge branch 'main' into refactor/shape-computation

0aed473

karlhigley requested a review from oliverholworthy February 15, 2023 00:54

oliverholworthy reviewed Feb 15, 2023

View reviewed changes

oliverholworthy approved these changes Feb 15, 2023

View reviewed changes

karlhigley merged commit 7e1b198 into NVIDIA-Merlin:main Feb 15, 2023

karlhigley mentioned this pull request Feb 15, 2023

[BUG] GroupBy schema shapes are incorrect when using list agg on list column #1763

Open

karlhigley mentioned this pull request Feb 28, 2023

[BUG] ListSlice(pad=True) does not generate valuecount.min and max properties in the schema #1749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up the way shapes are computed and specified #1760

Clean up the way shapes are computed and specified #1760

karlhigley commented Feb 13, 2023 •

edited

Loading

oliverholworthy Feb 13, 2023

karlhigley Feb 13, 2023

oliverholworthy Feb 14, 2023

karlhigley Feb 14, 2023

karlhigley Feb 14, 2023

karlhigley Feb 14, 2023

oliverholworthy Feb 13, 2023

karlhigley Feb 13, 2023

oliverholworthy Feb 14, 2023

karlhigley Feb 14, 2023

oliverholworthy Feb 14, 2023

oliverholworthy Feb 14, 2023

karlhigley Feb 15, 2023

oliverholworthy Feb 14, 2023

karlhigley Feb 14, 2023

karlhigley Feb 15, 2023

karlhigley Feb 15, 2023

oliverholworthy Feb 15, 2023

oliverholworthy Feb 15, 2023

oliverholworthy Feb 15, 2023

karlhigley Feb 15, 2023

karlhigley Feb 15, 2023

oliverholworthy Feb 15, 2023

karlhigley Feb 15, 2023

		@@ -129,7 +129,7 @@ def transform(self, col_selector: ColumnSelector, df: DataFrameType) -> DataFram

		def _compute_dtype(self, col_schema, input_schema):

Clean up the way shapes are computed and specified #1760

Clean up the way shapes are computed and specified #1760

Conversation

karlhigley commented Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlhigley commented Feb 13, 2023 •

edited

Loading