Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DO-NOT-MERGE] Reference PR for flatMapGroupsWithState in PySpark #37863

Conversation

HeartSaVioR
Copy link
Contributor

DO-NOT-MERGE! May need to split the PR to multiple reviewable size of PRs. Use this only for reference.

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR added a commit that referenced this pull request Sep 14, 2022
…PythonArrowOutput

### What changes were proposed in this pull request?

This PR proposes to change PythonArrowInput and PythonArrowOutput to be more generic to cover the complex data type on both input and output. This is a baseline work for #37863.

### Why are the changes needed?

The traits PythonArrowInput and PythonArrowOutput can be further generalized to cover complex data type on both input and output. E.g. Not all operators would have simple InternalRow as input data to pass to Python worker and vice versa for output data.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #37864 from HeartSaVioR/SPARK-40414.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
@HeartSaVioR HeartSaVioR force-pushed the WIP-applyinpandaswithstate-jungtaek-pipeline-binpack branch from ce85c20 to d22d7db Compare September 14, 2022 05:52
@HeartSaVioR HeartSaVioR force-pushed the WIP-applyinpandaswithstate-jungtaek-pipeline-binpack branch from de9636d to e60408f Compare September 14, 2022 05:56
HeartSaVioR added a commit that referenced this pull request Sep 15, 2022
…ickled PySpark Row to JVM Row

### What changes were proposed in this pull request?

This PR adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row.

Co-authored with HyukjinKwon .

This is a breakdown PR of #37863.

### Why are the changes needed?

This change will be leveraged in [SPARK-40434](https://issues.apache.org/jira/browse/SPARK-40434).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A. We will make sure test suites are constructed via E2E manner under [SPARK-40431](https://issues.apache.org/jira/browse/SPARK-40431).

Closes #37891 from HeartSaVioR/SPARK-40433.

Lead-authored-by: Jungtaek Lim <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
HeartSaVioR added a commit that referenced this pull request Sep 15, 2022
…out in PySpark

### What changes were proposed in this pull request?

This PR introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala codebase to support convenient conversion between PySpark implementation and Scala implementation.

Co-authored with HyukjinKwon .

This is a breakdown PR of #37863.

### Why are the changes needed?

This change will be leveraged in SPARK-40434.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A. We will make sure test suites are constructed via E2E manner under SPARK-40431.

Closes #37889 from HeartSaVioR/SPARK-40432.

Lead-authored-by: Jungtaek Lim <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Sep 20, 2022
…ickled PySpark Row to JVM Row

### What changes were proposed in this pull request?

This PR adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row.

Co-authored with HyukjinKwon .

This is a breakdown PR of apache#37863.

### Why are the changes needed?

This change will be leveraged in [SPARK-40434](https://issues.apache.org/jira/browse/SPARK-40434).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A. We will make sure test suites are constructed via E2E manner under [SPARK-40431](https://issues.apache.org/jira/browse/SPARK-40431).

Closes apache#37891 from HeartSaVioR/SPARK-40433.

Lead-authored-by: Jungtaek Lim <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Sep 20, 2022
…out in PySpark

### What changes were proposed in this pull request?

This PR introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala codebase to support convenient conversion between PySpark implementation and Scala implementation.

Co-authored with HyukjinKwon .

This is a breakdown PR of apache#37863.

### Why are the changes needed?

This change will be leveraged in SPARK-40434.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A. We will make sure test suites are constructed via E2E manner under SPARK-40431.

Closes apache#37889 from HeartSaVioR/SPARK-40432.

Lead-authored-by: Jungtaek Lim <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Sep 22, 2022

Closing this since we only have test suite PR as remaining one now.

dongjoon-hyun pushed a commit that referenced this pull request Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants