Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlToS3Operator not able to write data with partition_cols provided. #30382

Closed
2 tasks done
amolsr opened this issue Mar 31, 2023 · 8 comments · Fixed by #30460
Closed
2 tasks done

SqlToS3Operator not able to write data with partition_cols provided. #30382

amolsr opened this issue Mar 31, 2023 · 8 comments · Fixed by #30460
Assignees
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@amolsr
Copy link
Contributor

amolsr commented Mar 31, 2023

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

I am using the standard operator version which comes with apache/airflow:2.5.2.

Apache Airflow version

2.5.2

Operating System

Ubuntu 22.04.2 LTS

Deployment

Official Apache Airflow Helm Chart

Deployment details

I have used a simple docker compose setup can using the same in my local.

What happened

I am using SqlToS3Operator in my Dag. I need to store the data using the partition col. The operator writes the data in a temporary file but in my case it should be a folder. I am getting the below error for the same.

[2023-03-31, 03:47:57 UTC] {sql_to_s3.py:175} INFO - Writing data to temp file
[2023-03-31, 03:47:57 UTC] {taskinstance.py:1775} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/transfers/sql_to_s3.py", line 176, in execute
    getattr(data_df, file_options.function)(tmp_file.name, **self.pd_kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 2685, in to_parquet
    **kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 423, in to_parquet
    **kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 190, in write
    **kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/pyarrow/parquet/__init__.py", line 3244, in write_to_dataset
    max_rows_per_group=row_group_size)
  File "/home/airflow/.local/lib/python3.7/site-packages/pyarrow/dataset.py", line 989, in write_dataset
    min_rows_per_group, max_rows_per_group, create_dir
  File "pyarrow/_dataset.pyx", line 2775, in pyarrow._dataset._filesystemdataset_write
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
NotADirectoryError: [Errno 20] Cannot create directory '/tmp/tmp3z4dpv_p.parquet/application_createdAt=2020-06-05 11:47:44.000000000'. Detail: [errno 20] Not a directory

What you think should happen instead

The Operator should have supported the partition col as well.

How to reproduce

I am using the below code snipet for the same.

    sql_to_s3_task = SqlToS3Operator(
        task_id="sql_to_s3_task",
        sql_conn_id="mysql_con",
        query=sql,
        s3_bucket=Variable.get("AWS_S3_BUCKET"),
        aws_conn_id="aws_con",
        file_format="parquet",
        s3_key="Fact_applications",
        pd_kwargs={
            "partition_cols":['application_createdAt']
        },
        replace=True,
    )

This could be using to reproduce the same.

Anything else

I believe this logic should be updated for the same.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@amolsr amolsr added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Mar 31, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 31, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@kaxil
Copy link
Member

kaxil commented Apr 1, 2023

@utkarsharma2 Want to take this one? :)

@hussein-awala hussein-awala added provider:amazon-aws AWS/Amazon - related issues good first issue and removed area:providers needs-triage label for new issues that we didn't triage yet labels Apr 2, 2023
@hussein-awala
Copy link
Member

@amolsr Feel free to open a PR to fix it

@utkarsharma2
Copy link
Contributor

@amolsr Would you like to pair on this one?

@amolsr
Copy link
Contributor Author

amolsr commented Apr 3, 2023

Yeah sure.

@PasunuriSrinidhi
Copy link

I would like to work on this issue ,can you please assign this to me

To fix this error, I want to try the following steps:

1.Check if the /tmp directory exists and is writable.
2.Check if the tmp3z4dpv_p.parquet directory exists in /tmp. If it exists, ensure that it is a directory and is writable. If it does not exist, create it with the necessary permissions.
3.Check if the application_createdAt=2020-06-05 11:47:44.000000000 directory exists in tmp3z4dpv_p.parquet. If it exists, ensure that it is a directory and is writable. If it does not exist, create it with the necessary permissions.
4.If the above steps do not work, try changing the destination directory for the output file to a different location and see if the error persists.

@hussein-awala
Copy link
Member

@PasunuriSrinidhi there is an open (and active) PR linked to this issue, if you want to help closing it, you can review and test the created PR #30460.

@PasunuriSrinidhi
Copy link

@hussein-awala I will review and test the created PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants