More flexibility in PyTorch train input and outputs #4448

hkvision · 2022-04-19T07:54:18Z

From studying vz-recommenders code:

Currently we only support features to be ndarrays, a single feature of shape (batch, ) and several features together of shape (batch, n). But if users define the model to take a list of tensors as input, we can't support this.
E.g.

def forward(self, f1, f2, f3)  # where f1 is a list of torch tensors, f2 and f3 are a single tensor separately

For this case in trainingh_operator.py train_batch, we need to modify to:

output = self.model(features[0:10], features[10], features[11])

Basically user definition of model forward and their own dataset can be quite flexible (even their code is not well enough written), but whenever there is some flexibility, we will have issue in detecting the correct behavior.

Also for the output, may have some postprocess steps in user code, e.g.

y1_pred = out[0].squeeze()

If output is a single list and user takes the first element in their own train loop, then in our code we will treat it as multiple outputs, which is wrong.

The text was updated successfully, but these errors were encountered:

jason-dai · 2022-04-19T09:01:21Z

I think we should take a list of ndarray as input (e.g., for xshards)? @yushan111 @sgwhat

hkvision · 2022-04-22T08:51:24Z

The behavior for PyTorch Dataset and DataLoader is as follows:

If getitem in Dataset returns a list of single features, then DataLoader will return a list of 1D torch tensors.
If getitem in Dataset returns directly a 1D torch tensor to represent a list of single features, then DataLoader will return one 2D torch tensor.

Probably one straightforward way to simulate such behavior is to support nested list in feature_cols? If the entry of a feature_cols is a list, then we return a list of ndarrays.

shanyu-sys · 2022-07-13T07:44:21Z

I think we should take a list of ndarray as input (e.g., for xshards)? @yushan111 @sgwhat

Sorry that I may not catch it. xshards already supports a list of ndarray as input, e.g. estimators take a dictionary of xshards as input: {'x': features, 'y': labels}, where features/labels can be a numpy array or a list of numpy arrays. In the above example, features could be [f1, f2, f3], where f1 is an ndarray of shape (batch ,10), f2 and f3 of shape (batch, )

hkvision added the orca label Apr 19, 2022

This was referenced Apr 19, 2022

Umbrella issue of improving Orca user experience (update from time to time) #4377

Closed

PyTorch Ray Estimator for customized data and train loop #3557

Open

hkvision mentioned this issue Jul 13, 2022

Better user experience for XShards of Pandas Dataframe input for Orca Estimators #5089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexibility in PyTorch train input and outputs #4448

More flexibility in PyTorch train input and outputs #4448

hkvision commented Apr 19, 2022

jason-dai commented Apr 19, 2022

hkvision commented Apr 22, 2022

shanyu-sys commented Jul 13, 2022

More flexibility in PyTorch train input and outputs #4448

More flexibility in PyTorch train input and outputs #4448

Comments

hkvision commented Apr 19, 2022

jason-dai commented Apr 19, 2022

hkvision commented Apr 22, 2022

shanyu-sys commented Jul 13, 2022