Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xshard max, split, and select on Ray #2300

Closed
wants to merge 4 commits into from
Closed

Conversation

jenniew
Copy link
Contributor

@jenniew jenniew commented May 6, 2020

Add min/max operation. Related issue: https://github.com/analytics-zoo/orca/issues/6
Add train_test_split operation. Related issue: https://github.com/analytics-zoo/orca/issues/4
Add selection with column name/slice range. Related issue: https://github.com/analytics-zoo/orca/issues/5
Add **kwargs support for read_csv, read_json. Related issue:https://github.com/analytics-zoo/orca/issues/20
Change DataShards.apply to DataShards.transform_shard.
Add unit tests for all new APIs.
Fix xshard spark issues.
Refactor code.

@@ -74,18 +74,113 @@ def test_repartition(self):

def test_apply(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_transform_shard

else:
raise Exception("Unsupported file type")
df_list.append(df)
self.data = pd.concat(df_list)
return 0

def apply(self, func, *args):
self.data = func(self.data, *args)
def transform_shard(self, func, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to transform

self.rows = rows
self.columns = columns

def transform_shard(self, func, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to transform

@jason-dai
Copy link
Contributor

jason-dai commented May 6, 2020

There are too many changes in this PR, which makes it difficult to review and verify. We should make sure each PR only contains the minimum feature sets. One fundamental issue with the changes is that how to maintain multiple DataShards with one common set of underlying Actors. For instance,

s = xshard.pands.read_csv(...)
s1 = s['user_id']
s1.tranform_shard(func) #what is the expected input of func? will it also change s?

And

s = xshard.pands.read_csv(...)
s1, s2 = train_test_split(s)
s.transform_shard(func) #will it change s1 and s2?

I suggest we close this PR and open several smaller ones (e.g., fixing https://github.com/analytics-zoo/orca/issues/20 first)

@jenniew
Copy link
Contributor Author

jenniew commented May 6, 2020

There are too many changes in this PR, which makes it difficult to review and verify. We should make sure each PR only contains the minimum feature sets. One fundamental issue with the changes is that how to maintain multiple DataShards with one common set of underlying Actors. For instance,

s = xshard.pands.read_csv(...)
s1 = s['user_id']
s1.tranform_shard(func) #what is the expected input of func? will it also change s?

And

s = xshard.pands.read_csv(...)
s1, s2 = train_test_split(s)
s.transform_shard(func) #will it change s1 and s2?

I suggest we close this PR and open several smaller ones (e.g., fixing analytics-zoo/orca#20 first)

The current transform would change data in s. We need to change the implementation to return new DataFrame or Series. Change Actor implementation or just use object id no Actor.
Already create PR: https://github.com/intel-analytics/analytics-zoo/pull/2305 for (https://github.com/analytics-zoo/orca/issues/20).
Close this PR.

@jenniew jenniew closed this May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants