Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34216: [Python] Support for reading JSON Datasets With Python #34586

Merged
merged 21 commits into from
Apr 12, 2023

Conversation

R-JunmingChen
Copy link
Contributor

@R-JunmingChen R-JunmingChen commented Mar 16, 2023

This PR supports for reading JSON Datasets With Python. As mentioned in #34216, only the reading ability are supported.

Please compare the difference between my implemenation of _json.pyx, _json.pyd and _csv.pyx _csv.pyd.
Cause _csv.pyd utilize pointer for cpp class and my implementation doesn't.

What changes are included in this PR?

C++: add inclusion for file_json.h
Python: reference C++ codes and support reading JSON Datasets

Are these changes tested?
Yes
6 test samples added in tests/test_dataset.py

@github-actions
Copy link

@R-JunmingChen R-JunmingChen changed the title GH-34216: [Python][C++][Docs] Support for reading JSON Datasets With Python GH-34216: [Python][C++]Support for reading JSON Datasets With Python Mar 17, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I have a few questions / suggestions. Also, we will need to add some unit tests for this feature.

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
python/pyarrow/_dataset.pyx Show resolved Hide resolved
python/pyarrow/_json.pxd Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 21, 2023
@jorisvandenbossche jorisvandenbossche changed the title GH-34216: [Python][C++]Support for reading JSON Datasets With Python GH-34216: [Python] Support for reading JSON Datasets With Python Mar 21, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 22, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 24, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Mar 25, 2023
@github-actions github-actions bot added the awaiting change review Awaiting change review label Mar 25, 2023
@R-JunmingChen
Copy link
Contributor Author

Thanks for working on this! I have a few questions / suggestions. Also, we will need to add some unit tests for this feature.

Hi, I have added some tests, which references the csv tests. Please review them when you are free

@R-JunmingChen R-JunmingChen requested a review from westonpace April 4, 2023 14:04
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look good, thank you for adding them. One small suggestion I think to improve them.

python/pyarrow/tests/test_dataset.py Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting change review Awaiting change review labels Apr 10, 2023
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Apr 11, 2023
@github-actions github-actions bot removed the awaiting committer review Awaiting committer review label Apr 12, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed some lint changes. Assuming CI passes I think this is good to go.

@github-actions github-actions bot added awaiting review Awaiting review awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 12, 2023
@westonpace westonpace merged commit 4963105 into apache:main Apr 12, 2023
@ursabot
Copy link

ursabot commented Apr 15, 2023

Benchmark runs are scheduled for baseline = 0434ab6 and contender = 4963105. 4963105 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️11.99% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.55% ⬆️0.12%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 49631057 ec2-t3-xlarge-us-east-2
[Failed] 49631057 test-mac-arm
[Finished] 49631057 ursa-i9-9960x
[Finished] 49631057 ursa-thinkcentre-m75q
[Finished] 0434ab65 ec2-t3-xlarge-us-east-2
[Failed] 0434ab65 test-mac-arm
[Finished] 0434ab65 ursa-i9-9960x
[Finished] 0434ab65 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Apr 15, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
apache#34586)

This PR supports for reading JSON Datasets With Python. As mentioned in [apache#34216](apache#34216), only the reading ability are supported.

Please compare the difference between my implemenation of _json.pyx, _json.pyd and _csv.pyx _csv.pyd.
Cause _csv.pyd utilize pointer for cpp class and my implementation doesn't. 

**What changes are included in this PR?**

C++: add inclusion for file_json.h
Python: reference C++ codes and support reading JSON Datasets

**Are these changes tested?**
Yes
6 test samples added in tests/test_dataset.py

* Closes: apache#34216

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
apache#34586)

This PR supports for reading JSON Datasets With Python. As mentioned in [apache#34216](apache#34216), only the reading ability are supported.

Please compare the difference between my implemenation of _json.pyx, _json.pyd and _csv.pyx _csv.pyd.
Cause _csv.pyd utilize pointer for cpp class and my implementation doesn't. 

**What changes are included in this PR?**

C++: add inclusion for file_json.h
Python: reference C++ codes and support reading JSON Datasets

**Are these changes tested?**
Yes
6 test samples added in tests/test_dataset.py

* Closes: apache#34216

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
apache#34586)

This PR supports for reading JSON Datasets With Python. As mentioned in [apache#34216](apache#34216), only the reading ability are supported.

Please compare the difference between my implemenation of _json.pyx, _json.pyd and _csv.pyx _csv.pyd.
Cause _csv.pyd utilize pointer for cpp class and my implementation doesn't. 

**What changes are included in this PR?**

C++: add inclusion for file_json.h
Python: reference C++ codes and support reading JSON Datasets

**Are these changes tested?**
Yes
6 test samples added in tests/test_dataset.py

* Closes: apache#34216

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Add bindings for JSON format in Dataset
3 participants