-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xshard max, split, and select on Ray #2300
Conversation
@@ -74,18 +74,113 @@ def test_repartition(self): | |||
|
|||
def test_apply(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_transform_shard
else: | ||
raise Exception("Unsupported file type") | ||
df_list.append(df) | ||
self.data = pd.concat(df_list) | ||
return 0 | ||
|
||
def apply(self, func, *args): | ||
self.data = func(self.data, *args) | ||
def transform_shard(self, func, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to transform
self.rows = rows | ||
self.columns = columns | ||
|
||
def transform_shard(self, func, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to transform
There are too many changes in this PR, which makes it difficult to review and verify. We should make sure each PR only contains the minimum feature sets. One fundamental issue with the changes is that how to maintain multiple DataShards with one common set of underlying Actors. For instance,
And
I suggest we close this PR and open several smaller ones (e.g., fixing https://github.com/analytics-zoo/orca/issues/20 first) |
The current transform would change data in s. We need to change the implementation to return new DataFrame or Series. Change Actor implementation or just use object id no Actor. |
Add min/max operation. Related issue: https://github.com/analytics-zoo/orca/issues/6
Add train_test_split operation. Related issue: https://github.com/analytics-zoo/orca/issues/4
Add selection with column name/slice range. Related issue: https://github.com/analytics-zoo/orca/issues/5
Add **kwargs support for read_csv, read_json. Related issue:https://github.com/analytics-zoo/orca/issues/20
Change DataShards.apply to DataShards.transform_shard.
Add unit tests for all new APIs.
Fix xshard spark issues.
Refactor code.