-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BaseSQLToGCSOperator approx_max_file_size_bytes #25469
Fix BaseSQLToGCSOperator approx_max_file_size_bytes #25469
Conversation
When using the parquet file_format, using `tmp_file_handle.tell()` always points to the beginning of the file after the data has been saved and therefore is not a good indicator for the files current size.
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
Nice. |
Tests are failing :( |
I'll have a look at it. |
Save the current file pointer position and set the file pointer position to `os.SEEK_END`. file_size is set to the new position, and the file pointer's position goes back to the saved position. Currently, after a parquet write operation the pointer is set to 0, and therefore, simply executing `tmp_file_handle.tell()` is not sufficient to determine the current size. This sequence is added to allow file splitting when the export format is set to parquet.
The tests were failing because when writing bytes into the file, python buffers the data until it is flushed via the method As a quick sanity test you can run: with open('test', 'wb') as f:
f.write(b'hello')
print(f.tell())
import os
print(os.stat(f.name).st_size)
f.seek(0, os.SEEK_END)
print(f.tell())
f.flush()
print(os.stat(f.name).st_size) |
Yeah. Tests are useful it seems :) |
When using the parquet file_format, using
tmp_file_handle.tell()
always points to the beginning of the file after the data has been saved
and therefore is not a good indicator for the files current size.
closes: #25313