Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexibility in PyTorch train input and outputs #4448

Open
hkvision opened this issue Apr 19, 2022 · 3 comments
Open

More flexibility in PyTorch train input and outputs #4448

hkvision opened this issue Apr 19, 2022 · 3 comments
Labels

Comments

@hkvision
Copy link
Contributor

From studying vz-recommenders code:

  1. Currently we only support features to be ndarrays, a single feature of shape (batch, ) and several features together of shape (batch, n). But if users define the model to take a list of tensors as input, we can't support this.
    E.g.
def forward(self, f1, f2, f3)  # where f1 is a list of torch tensors, f2 and f3 are a single tensor separately

For this case in trainingh_operator.py train_batch, we need to modify to:

output = self.model(features[0:10], features[10], features[11])

Basically user definition of model forward and their own dataset can be quite flexible (even their code is not well enough written), but whenever there is some flexibility, we will have issue in detecting the correct behavior.

  1. Also for the output, may have some postprocess steps in user code, e.g.
y1_pred = out[0].squeeze()

If output is a single list and user takes the first element in their own train loop, then in our code we will treat it as multiple outputs, which is wrong.

@jason-dai
Copy link
Contributor

I think we should take a list of ndarray as input (e.g., for xshards)? @yushan111 @sgwhat

@hkvision
Copy link
Contributor Author

The behavior for PyTorch Dataset and DataLoader is as follows:

  • If getitem in Dataset returns a list of single features, then DataLoader will return a list of 1D torch tensors.
  • If getitem in Dataset returns directly a 1D torch tensor to represent a list of single features, then DataLoader will return one 2D torch tensor.

Probably one straightforward way to simulate such behavior is to support nested list in feature_cols? If the entry of a feature_cols is a list, then we return a list of ndarrays.

@shanyu-sys
Copy link
Contributor

I think we should take a list of ndarray as input (e.g., for xshards)? @yushan111 @sgwhat

Sorry that I may not catch it. xshards already supports a list of ndarray as input, e.g. estimators take a dictionary of xshards as input: {'x': features, 'y': labels}, where features/labels can be a numpy array or a list of numpy arrays. In the above example, features could be [f1, f2, f3], where f1 is an ndarray of shape (batch ,10), f2 and f3 of shape (batch, )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants