Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_parquet doesn't create partitions #1666

Closed
FredericJames opened this issue Jul 20, 2020 · 3 comments · Fixed by #1667
Closed

df.to_parquet doesn't create partitions #1666

FredericJames opened this issue Jul 20, 2020 · 3 comments · Fixed by #1667
Labels
bug Something isn't working

Comments

@FredericJames
Copy link

My environment: Databrick platform: runtime 7.0 ML
Koalas: 1.0.1

I'm trying to write parquets from a koalas dataframe to S3 with partitions. The partitions are not created (I tried with a single or multiple partition cols).
If I'm using the pyspark API, the partitions are created.

code:
no partition created
df.to_parquet(path='s3://{bucket}/{Path_to_data}, mode='overwrite', compression='gzip', partition_cols=['year','month','day'])

Partition created
df.to_spark().write.mode('overwrite').partitionBy('year', 'month', 'day').parquet('s3://{bucket}/{Path_to_data}')

I would like to have the possibilities to use the partitions in koalas in the same way I'm doing it in spark.

@ueshin ueshin added the bug Something isn't working label Jul 20, 2020
@ueshin
Copy link
Collaborator

ueshin commented Jul 20, 2020

Seems like it's a bug.
Could you use DataFrame.spark.to_spark_io as a workaround?

df.spark.to_spark_io(path='s3://{bucket}/{Path_to_data}, format='parquet', mode='overwrite', compression='gzip', partition_cols=['year','month','day'])

I'll work on fixing this.

@ueshin
Copy link
Collaborator

ueshin commented Jul 21, 2020

I submitted a PR #1667.

@FredericJames
Copy link
Author

Thank you. I'll wait the fix. It's mainly for my data scientist and I 'd like to keep it simple for them.
I'll use the pyspark api until then.

HyukjinKwon pushed a commit that referenced this issue Jul 21, 2020
Refine Spark I/O to:

- Set `partitionBy` explicitly in `to_parquet`.
- Add `mode` and `partition_cols` to `to_csv` and `to_json`.
- Fix type hints to use `Optional`.

Resolves #1666.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants