Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete NumPy universal functions compat for DataFrames #1127

Merged
merged 1 commit into from
Dec 16, 2019

Conversation

HyukjinKwon
Copy link
Member

This PR proposes to complete NumPy's universal functions support against DataFrame.

>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> np.log(kdf)
         id
0       NaN
1  0.000000
2  0.693147
3  1.098612
4  1.386294
5  1.609438
6  1.791759
7  1.945910
8  2.079442
9  2.197225

@HyukjinKwon HyukjinKwon force-pushed the numpy-compay-frame branch 3 times, most recently from 742e285 to 67cd75c Compare December 13, 2019 04:42
@codecov-io
Copy link

codecov-io commented Dec 13, 2019

Codecov Report

Merging #1127 into master will increase coverage by <.01%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1127      +/-   ##
==========================================
+ Coverage   95.15%   95.15%   +<.01%     
==========================================
  Files          35       35              
  Lines        7017     7039      +22     
==========================================
+ Hits         6677     6698      +21     
- Misses        340      341       +1
Impacted Files Coverage Δ
databricks/koalas/frame.py 96.81% <95.65%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb763ea...1737eb0. Read the comment docs.

# Test only top 5 for now. 'compute.ops_on_diff_frames' option increases too much time.
try:
set_option('compute.ops_on_diff_frames', True)
for np_name, spark_func in list(binary_np_spark_mappings.items())[:5]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked all the tests pass for its complete list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in my local, right_shift function always fails.

>>> pdf
     a   b
0   65  83
1   50 -14
2  -95  35
3   98  97
4  -19  52
5  -39  53
6  -60  45
7   37  51
8  -25   7
9  -11  24
10  34   8
11 -53   8
12  -6 -12
13 -40 -71
14 -14 -36
15 -62 -98
16  93 -38
17 -81  56
18  45  27
19  84  97
20  38  46
21 -30 -76
22 -29 -24
23 -69 -67
>>> np.right_shift(pdf, pdf)
    a  b
0   0  0
1   0  0
2   0  0
3   0  0
4   0  0
5   0  0
6   0  0
7   0  0
8   0  0
9   0  0
10  0  0
11  0  0
12  0  0
13  0  0
14  0  0
15  0  0
16  0  0
17  0  0
18  0  0
19  0  0
20  0  0
21  0  0
22  0  0
23  0  0

whereas

>>> np.right_shift(kdf, kdf)
     a  b
0   32  0
1    0 -1
2   -1  0
3    0  0
4   -1  0
5   -1  0
6   -4  0
7    0  0
8   -1  0
9   -1  0
10   0  0
11  -1  0
12  -1 -1
13  -1 -1
14  -1 -1
15 -16 -1
16   0 -1
17  -1  0
18   0  0
19   0  0
20   0  0
21  -1 -1
22  -1 -1
23  -1 -1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.right_shift(kdf, 1)

seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, they are same in my local:

>>> np.right_shift(pdf, pdf)
     a   b
0    0   0
1    0   0
2   -1   0
3    0  -1
4   -1   0
5    0   0
6   -1  -4
7   -1  -1
8    0  -1
9   16  -1
10  16  -1
11   0  -1
12  -1  -1
13  -1  -1
14   0  -2
15   1  -1
16  -1   0
17   0  -1
18   0  -1
19   0  -1
20   0  -1
21   0   0
22  -1   0
23   0   0
24  -1  -1
25   0   0
26  -1   0
27  -1   0
28   0  -1
29  -8   0
30   0  -1
31   0  -1
32  -1   0
33   0   0
34   0   0
35  -1   0
36  -1  -1
37   0  -1
38  -1  -1
39   0  -2
40  -1   0
41   0   0
42   0   0
43   0  -1
44   0   0
45  -1   0
46  -1  -1
47   0  -1
48   0  -1
49   0  -1
50  -1  -1
51   0 -32
52   0   0
>>> np.right_shift(kdf, kdf)
     a   b
0    0   0
1    0   0
2   -1   0
3    0  -1
4   -1   0
5    0   0
6   -1  -4
7   -1  -1
8    0  -1
9   16  -1
10  16  -1
11   0  -1
12  -1  -1
13  -1  -1
14   0  -2
15   1  -1
16  -1   0
17   0  -1
18   0  -1
19   0  -1
20   0  -1
21   0   0
22  -1   0
23   0   0
24  -1  -1
25   0   0
26  -1   0
27  -1   0
28   0  -1
29  -8   0
30   0  -1
31   0  -1
32  -1   0
33   0   0
34   0   0
35  -1   0
36  -1  -1
37   0  -1
38  -1  -1
39   0  -2
40  -1   0
41   0   0
42   0   0
43   0  -1
44   0   0
45  -1   0
46  -1  -1
47   0  -1
48   0  -1
49   0  -1
50  -1  -1
51   0 -32
52   0   0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems versions matter ... (?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will investigate it separately in another PR.

@HyukjinKwon HyukjinKwon changed the title Complete NumPy universal functions for DataFrames Complete NumPy universal functions compat for DataFrames Dec 13, 2019
@softagram-bot
Copy link

Softagram Impact Report for pull/1127 (head commit: 1737eb0)

⚠️ Copy paste found

ℹ️ test_numpy_compat.py: Copy paste fragment on line 50 shared with ../test_dataframe.py, ../test_indexes.py:

    def pdf(self):
        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])...(truncated 105 chars)

ℹ️ test_numpy_compat.py: Copy paste fragment inside the same file on lines 85, 131:

        # Use randomly generated dataFrame
        pdf = pd.DataFrame(
            np.random.randint(-100, 100, size=(np.random.randint(100), 2)), columns=['...(truncated 498 chars)

ℹ️ frame.py: Copy paste fragment on line 5771 shared with ../namespace.py:

              on: Union[str, List[str], Tuple[str, ...], List[Tuple[str, ...]]] = None,
              left_on: Union[str, List[str], Tuple[s...(truncated 273 chars)

ℹ️ frame.py: Copy paste fragment inside the same file on lines 7269, 7352:


        # TODO: there is a similar logic to transpose in, for instance,
        #  DataFrame.any, Series.quantile. Maybe ...(truncated 1065 chars)

ℹ️ frame.py: Copy paste fragment inside the same file on lines 4900, 4921:

            sdf = self._sdf.select(
                self._internal.index_scols +
                [self._internal.scol_for(idx...(truncated 466 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon merged commit 9343a1d into databricks:master Dec 16, 2019
@HyukjinKwon HyukjinKwon deleted the numpy-compay-frame branch September 11, 2020 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants