Xshard max, split, and select on Ray #2300

jenniew · 2020-05-06T05:35:08Z

Add min/max operation. Related issue: https://github.com/analytics-zoo/orca/issues/6
Add train_test_split operation. Related issue: https://github.com/analytics-zoo/orca/issues/4
Add selection with column name/slice range. Related issue: https://github.com/analytics-zoo/orca/issues/5
Add **kwargs support for read_csv, read_json. Related issue:https://github.com/analytics-zoo/orca/issues/20
Change DataShards.apply to DataShards.transform_shard.
Add unit tests for all new APIs.
Fix xshard spark issues.
Refactor code.

jason-dai · 2020-05-06T08:57:28Z

pyzoo/test/zoo/xshard/test_ray_pandas.py

@@ -74,18 +74,113 @@ def test_repartition(self):

    def test_apply(self):


test_transform_shard

jason-dai · 2020-05-06T09:54:30Z

pyzoo/zoo/xshard/pandas/preprocessing.py

                else:
                    raise Exception("Unsupported file type")
                df_list.append(df)
        self.data = pd.concat(df_list)
        return 0

-    def apply(self, func, *args):
-        self.data = func(self.data, *args)
+    def transform_shard(self, func, *args, **kwargs):


change to transform

jason-dai · 2020-05-06T09:54:48Z

pyzoo/zoo/xshard/pandas/preprocessing.py

+        self.rows = rows
+        self.columns = columns
+
+    def transform_shard(self, func, *args, **kwargs):


change to transform

jason-dai · 2020-05-06T12:59:53Z

There are too many changes in this PR, which makes it difficult to review and verify. We should make sure each PR only contains the minimum feature sets. One fundamental issue with the changes is that how to maintain multiple DataShards with one common set of underlying Actors. For instance,

s = xshard.pands.read_csv(...)
s1 = s['user_id']
s1.tranform_shard(func) #what is the expected input of func? will it also change s?

And

s = xshard.pands.read_csv(...)
s1, s2 = train_test_split(s)
s.transform_shard(func) #will it change s1 and s2?

I suggest we close this PR and open several smaller ones (e.g., fixing https://github.com/analytics-zoo/orca/issues/20 first)

jenniew · 2020-05-06T23:44:06Z

There are too many changes in this PR, which makes it difficult to review and verify. We should make sure each PR only contains the minimum feature sets. One fundamental issue with the changes is that how to maintain multiple DataShards with one common set of underlying Actors. For instance,
s = xshard.pands.read_csv(...)
s1 = s['user_id']
s1.tranform_shard(func) #what is the expected input of func? will it also change s?
And
s = xshard.pands.read_csv(...)
s1, s2 = train_test_split(s)
s.transform_shard(func) #will it change s1 and s2?
I suggest we close this PR and open several smaller ones (e.g., fixing analytics-zoo/orca#20 first)

The current transform would change data in s. We need to change the implementation to return new DataFrame or Series. Change Actor implementation or just use object id no Actor.
Already create PR: https://github.com/intel-analytics/analytics-zoo/pull/2305 for (https://github.com/analytics-zoo/orca/issues/20).
Close this PR.

jenniew added 4 commits May 1, 2020 18:05

add min/max

1bf36e8

merge

0d83d60

add split, select, test

0b136fb

clean code

eef9a71

jason-dai reviewed May 6, 2020

View reviewed changes

jenniew closed this May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xshard max, split, and select on Ray #2300

Xshard max, split, and select on Ray #2300

jenniew commented May 6, 2020 •

edited

Loading

jason-dai May 6, 2020

jason-dai May 6, 2020

jason-dai May 6, 2020

jason-dai commented May 6, 2020 •

edited

Loading

jenniew commented May 6, 2020

		@@ -74,18 +74,113 @@ def test_repartition(self):

		def test_apply(self):

Xshard max, split, and select on Ray #2300

Xshard max, split, and select on Ray #2300

Conversation

jenniew commented May 6, 2020 • edited Loading

jason-dai May 6, 2020

Choose a reason for hiding this comment

jason-dai May 6, 2020

Choose a reason for hiding this comment

jason-dai May 6, 2020

Choose a reason for hiding this comment

jason-dai commented May 6, 2020 • edited Loading

jenniew commented May 6, 2020

jenniew commented May 6, 2020 •

edited

Loading

jason-dai commented May 6, 2020 •

edited

Loading