df.to_parquet doesn't create partitions #1666

FredericJames · 2020-07-20T14:44:36Z

My environment: Databrick platform: runtime 7.0 ML
Koalas: 1.0.1

I'm trying to write parquets from a koalas dataframe to S3 with partitions. The partitions are not created (I tried with a single or multiple partition cols).
If I'm using the pyspark API, the partitions are created.

code:
no partition created
df.to_parquet(path='s3://{bucket}/{Path_to_data}, mode='overwrite', compression='gzip', partition_cols=['year','month','day'])

Partition created
df.to_spark().write.mode('overwrite').partitionBy('year', 'month', 'day').parquet('s3://{bucket}/{Path_to_data}')

I would like to have the possibilities to use the partitions in koalas in the same way I'm doing it in spark.

ueshin · 2020-07-20T22:36:24Z

Seems like it's a bug.
Could you use DataFrame.spark.to_spark_io as a workaround?

df.spark.to_spark_io(path='s3://{bucket}/{Path_to_data}, format='parquet', mode='overwrite', compression='gzip', partition_cols=['year','month','day'])

I'll work on fixing this.

ueshin · 2020-07-21T00:27:43Z

I submitted a PR #1667.

FredericJames · 2020-07-21T01:07:57Z

Thank you. I'll wait the fix. It's mainly for my data scientist and I 'd like to keep it simple for them.
I'll use the pyspark api until then.

Refine Spark I/O to: - Set `partitionBy` explicitly in `to_parquet`. - Add `mode` and `partition_cols` to `to_csv` and `to_json`. - Fix type hints to use `Optional`. Resolves #1666.

ueshin added the bug Something isn't working label Jul 20, 2020

ueshin mentioned this issue Jul 21, 2020

Refine Spark I/O. #1667

Merged

HyukjinKwon closed this as completed in #1667 Jul 21, 2020

HyukjinKwon pushed a commit that referenced this issue Jul 21, 2020

Refine Spark I/O. (#1667)

b1e33f1

Refine Spark I/O to: - Set `partitionBy` explicitly in `to_parquet`. - Add `mode` and `partition_cols` to `to_csv` and `to_json`. - Fix type hints to use `Optional`. Resolves #1666.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.to_parquet doesn't create partitions #1666

df.to_parquet doesn't create partitions #1666

FredericJames commented Jul 20, 2020

ueshin commented Jul 20, 2020

ueshin commented Jul 21, 2020

FredericJames commented Jul 21, 2020

df.to_parquet doesn't create partitions #1666

df.to_parquet doesn't create partitions #1666

Comments

FredericJames commented Jul 20, 2020

ueshin commented Jul 20, 2020

ueshin commented Jul 21, 2020

FredericJames commented Jul 21, 2020