-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up the way shapes are computed and specified #1760
Clean up the way shapes are computed and specified #1760
Conversation
… refactor/shape-computation
@@ -18,6 +18,7 @@ | |||
from dask.dataframe.utils import meta_nonempty | |||
|
|||
from merlin.core.dispatch import DataFrameType, annotate | |||
from merlin.dtypes.shape import DefaultShapes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be a new feature of dtypes in core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shapes are implemented as a subfield of dtypes in Core, but this isn't really intended to be a feature of dtypes in particular. We might want to hide that implementation detail a bit more thoroughly by adjusting the imports.
As far as the defaults go, we just thought it was easier to read and understand than needing to remember which shapes mean what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DefaultShapes.LIST
object seems like it could be useful, but the tests are failing to find this in core ImportError: cannot import name 'DefaultShapes' from 'merlin.dtypes.shape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected since there is (or was) an outstanding Core PR that adds it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's still open: NVIDIA-Merlin/core#215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged now
@@ -129,7 +129,7 @@ def transform(self, col_selector: ColumnSelector, df: DataFrameType) -> DataFram | |||
|
|||
def _compute_dtype(self, col_schema, input_schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method required to be overriden with this change? or could it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be removed, but that breaks some of the tests even though the functionality works without it. (There are actually two places like this.)
sparse_names=None, | ||
sparse_max=None, | ||
sparse_as_dense=False, | ||
padded_cols=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming these arguments could be out-of-scope for this PR? Since it may be a breaking change for something that uses this function. It may be clearer to separate this into a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We couldn't really figure out what this function was supposed to do without the renames, so we went for it. The function name starts with an _
so we've appropriately signaled that external code shouldn't depend on its stability. The two places in the NVTabular that use it call it via argument order instead of specifying the names, so I understand the caution but I think we're okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it will be a safe change, and the names make things clearer :). We may consider removing this functon along with the other loader code to follow-up on the promise here in one of the upcoming releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And by that point we'll hopefully have updated the dataloader API to a point where this augment_schema function is no longer required either here or the copies in Transformers4Rec and in Merlin Models. The existence of this function suggests to me that there's something missing from the dataloader API as it currently exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the something that's missing is "transforms implemented as operators that provide schema tracking" 😺
agg_is_lists = {"list": True} | ||
|
||
agg = self._find_agg(col_schema, input_schema) | ||
is_list = agg_is_lists.get(agg, col_schema.is_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where we have fallback to the second argument of the .get
here and check the col_schema.is_list attribute. If we have a fixed list, would we like to preserve that info in the shape later on instead of turning in into a ragged list which is presumably default shape corresponding to DefaultShapes.LIST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure. Are there GroupBy
aggregations that don't change the shape of list columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the list of aggregations, I think everything changes the shape, either from list to scalar or from scalar to list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the default there should actually be False
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice it seems that it may not matter whether it's col_schema.is_list
or False
. I tried running Groupby with a an agg that is not "list"
on a dataframe with list feautres and we get errors in both cudf and pandas
from merlin.io import Dataset
import nvtabular as nvt
import cudf
df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
workflow = nvt.Workflow(["a", "b"] >> nvt.ops.Groupby(groupby_cols=["a"], aggs=["sum"]))
workflow.fit_transform(Dataset(df)).compute()
# => Raises DataError: All requested aggregations are unsupported.
Some of these aggs, like sum
or mean
could in theory work on list features if we wanted them to.
Pandas for example, handles sum across lists as concatenation.
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
df.groupby("a").sum()
# =>
b
a
1 [10, 20]
2 [20]
or if numpy arrays, then as an element-wise sum
df = pd.DataFrame({"a": [1, 1, 2], "b": [np.array([10]), np.array([20]), np.array([20])]})
df.groupby("a").sum()
# =>
b
a
1 [30]
2 [20]
var/std/mean/median
also work and in this example return scalars. If the element type contained an array of more then 1 dimension, then mean
could start returning a list type too
since cudf
doesn't appear to support aggregating across list columns this is probably something we don't need to be concerned about for now.
import cudf
df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
df.groupby("a").sum()
# => Raises DataError: All requested aggregations are unsupported.
df["b"].sum()
# => Raises TypeError: cannot perform sum with type list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think False
could be fine as the default here, and as far as this PR goes, it's no less clear than it was before.
If we need groupby aggregations for list columns as a future feature of NVTabular this will need to be revisited. I suppose even if cudf and pandas don't natively support this we could implement this ourselves by extracting the cupy/numpy arrays from the series in our own agg function to handle list column aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently it seems that the only agg that is supported for list columns is list
which will result in adding one additional dimension to the shape of the original list
from merlin.io import Dataset
import nvtabular as nvt
import cudf
df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
workflow = nvt.Workflow(["a", "b"] >> nvt.ops.Groupby(groupby_cols=["a"], aggs=["list"]))
workflow.fit_transform(Dataset(df)).compute()
# =>
a b_list
0 1 [[10], [20]]
1 2 [[20]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great to know; I really appreciate your thoroughness in testing this out. This probably warrants a further update to the shapes here, I'll open a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked here: #1763
properties["value_count"] = {"max": sparse_max[col]} | ||
if sparse_as_dense: | ||
properties["value_count"]["min"] = properties["value_count"]["max"] | ||
dims = Shape(((1, batch_size), None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the batch_size
required for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required for anything specific, just following the principle that the schema should always accurately reflect the data to the greatest extent possible. Here we have shape information since we know the batch size, so we fill it in in case that helps something downstream. I don't know if it actually will, but it seemed like the right thing to do.
This replaces shapes specified implicitly via the
is_list
/is_ragged
flag or via thevalue_count
property with more explicitly computed and specified shapes.Depends on NVIDIA-Merlin/core#215