Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BaseSQLToGCSOperator approx_max_file_size_bytes #25469

Conversation

dclandau
Copy link
Contributor

@dclandau dclandau commented Aug 2, 2022

When using the parquet file_format, using tmp_file_handle.tell()
always points to the beginning of the file after the data has been saved
and therefore is not a good indicator for the files current size.

closes: #25313

When using the parquet file_format, using `tmp_file_handle.tell()`
always points to the beginning of the file after the data has been saved
and therefore is not a good indicator for the files current size.
@dclandau dclandau requested a review from turbaszek as a code owner August 2, 2022 12:37
@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Aug 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 2, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Aug 2, 2022

Nice.

@eladkal
Copy link
Contributor

eladkal commented Aug 2, 2022

Tests are failing :(

@dclandau
Copy link
Contributor Author

dclandau commented Aug 2, 2022

I'll have a look at it.

Save the current file pointer position and set the file pointer position
to `os.SEEK_END`. file_size is set to the new position, and the file
pointer's position goes back to the saved position.

Currently, after a parquet write operation the pointer is set to 0,
and therefore, simply executing `tmp_file_handle.tell()` is not
sufficient to determine the current size. This sequence is added to
allow file splitting when the export format is set to parquet.
@dclandau
Copy link
Contributor Author

dclandau commented Aug 2, 2022

The tests were failing because when writing bytes into the file, python buffers the data until it is flushed via the method fp.flush().
Because the file was not being flushed prior to the if statement, os.stat(fp.name).st_size was only indicating the size of the file as is in the system, without including the bytes still to be flushed.

As a quick sanity test you can run:

with open('test', 'wb') as f:
    f.write(b'hello')
    print(f.tell())
    import os
    print(os.stat(f.name).st_size)
    f.seek(0, os.SEEK_END)
    print(f.tell())
    f.flush()
    print(os.stat(f.name).st_size)

@potiuk
Copy link
Member

potiuk commented Aug 3, 2022

Yeah. Tests are useful it seems :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BaseSQLToGCSOperator parquet export format not limiting file size bug
3 participants