-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orca: Add 2 NCF PyTorch examples with data_loader or XShards as inputs. #5691
Conversation
#5738 can remove model_dir after this PR merges. |
|
||
#Step 0: Parameters And Configuration | ||
|
||
Config={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use command line option and arguments instead of config dict?
"model_dir": "./model_dir/", | ||
} | ||
|
||
Config["train_rating"]=Config["main_path"]+ Config["dataset"]+".train.rating" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls check code style. (space between operators.)
invalidInputError(isinstance(right, SparkXShards), "right should be a SparkXShards") | ||
|
||
from bigdl.orca.data.utils import spark_df_to_pd_sparkxshards | ||
left_df, right_df=left.to_spark_df(), right.to_spark_df() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls check code style.
Can we merge three train_*.py files? |
To demonstrate different inputs, clearer to use separate scripts. |
# transform dataset into dict | ||
#train_data = train_data.to_numpy() | ||
#test_data = test_data.to_numpy() | ||
#train_data = {"x": train_data[:, : -1].astype(np.int64), | ||
# "y": train_data[:, -1].astype(np.float)} | ||
#test_data = {"x": test_data[:, : -1].astype(np.int64), | ||
# "y": test_data[:, -1].astype(np.float)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these comments?
def forward(self, *args): | ||
user, item = args[0], args[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put user, item in the args directly?
|
||
import numpy as np | ||
import pandas as pd | ||
import scipy.sparse as sp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move import scipy to local?
train_data, _ = train_test_split(data_X, test_size=0.1, random_state=100) | ||
|
||
train_dataset = NCFData(train_data, item_num=item_num, train_mat=train_mat, num_ng=4, is_training=True) | ||
train_loader = data.DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_workers=4 in the original code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size=batch_size, and put 256 in fit
_, test_data = train_test_split(data_X, test_size=0.1, random_state=100) | ||
|
||
test_dataset = NCFData(test_data) | ||
test_loader = data.DataLoader(test_dataset, shuffle=False, num_workers=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing batch_size
loss=loss_function, metrics=[Accuracy()], backend=backend) | ||
|
||
# Fit the estimator | ||
est.fit(data=train_loader_func, epochs=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original script trains for 20 epochs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size=256
# Step 5: Save and Load the Model | ||
|
||
# Evaluate the model | ||
result = est.evaluate(data=test_loader_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one more print to say it is evaluation results?
|
||
import numpy as np | ||
import pandas as pd | ||
import scipy.sparse as sp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
|
||
# Step 2: Define Dataset | ||
|
||
from bigdl.orca.data import XShards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this import necessary?
return data_XY | ||
|
||
|
||
def transform_to_dict(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this func
data_XY["y"] = labels_fill | ||
data_XY["y"] = data_XY["y"].astype(np.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use label as the column name?
Add NCF pytorch examples :
train_data_loader.py
andtrain_xshards.py
to theNCF
directory, with a shared NCF-modelmodel.py
.1.The
train_data_loader.py
example takesdata_loader
as the input of the model, supporting fitting the estimator withray
orspark
backend:# create the estimator
est = Estimator.from_torch(model=model_creator, optimizer=optimizer_creator,loss=loss_function, metrics=[Accuracy()],backend=Config["backend"])# backend="ray" or "spark"
# fit the estimator
est.fit(data=train_loader_func, epochs=1)
2.The
train_xshards.py
example takesXShards
as the input of the model, supporting fitting the estimator withray
orspark
backend:# create the estimator
est = Estimator.from_torch(model=model_creator, optimizer=optimizer_creator,loss=loss_function, metrics=[Accuracy()],backend=Config["backend"])# backend="ray" or "spark"
# fit the estimator
est.fit(data=train_shards, epochs=1,batch_size=Config["batch_size"],feature_cols=["x"],label_cols =["y"])