-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(flink): add read_***()
for Flink backend
#7777
feat(flink): add read_***()
for Flink backend
#7777
Conversation
ca1ec35
to
76ee200
Compare
Let's avoid adding |
76ee200
to
4a125d9
Compare
Thanks for the note, removed |
register()
and read_***()
for Flink backendread_***()
for Flink backend
read_***()
for Flink backendread_***()
for Flink backend
4a125d9
to
8cf8ed5
Compare
8cf8ed5
to
601cd13
Compare
ir.Table | ||
The just-registered table | ||
""" | ||
obj = self._get_dataframe_from_path(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does flink support natively loading directly from files? If so, I think we should use that.
If it doesn't, I'm not sure how I feel about automatically loading files using pandas and forwarding them that way. We do something similar for local in-memory backends like duckdb/pandas where file paths are unambiguously local. But for a potentially distributed system like flink, automatically using a local file reader may have unexpected behavior with path names, and also may be inefficient.
If we do decide to handle file reading manually using an in-memory reader here, we should use the ones provided by pyarrow
directly, not pandas. These have builtin support for directory datasets, and better match the behavior of other backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree that reading the file is not ideal while creating the table. Flink actually supports creating tables with filesystem connector. This however requires specifying the schema, as create_table()
also requires:
def create_table(
self,
name: str,
obj: pd.DataFrame | pa.Table | ir.Table | None = None,
*,
schema: sch.Schema | None = None,
database: str | None = None,
catalog: str | None = None,
tbl_properties: dict | None = None,
watermark: Watermark | None = None,
temp: bool = False,
overwrite: bool = False,
) -> ir.Table:
...
if obj is None and schema is None:
raise exc.IbisError("`schema` or `obj` is required")
...
So I think we have two options:
- Add a required argument
schema
forread_***()
. This would deviate from the interface existing for other backends. - "Read" the schema from the file with
pyarrow
and feed it intocreate_table()
withfilesystem
connector. This has the same issue you raised in accessing the files in a distributed system.
Which one do you think makes more sense? We could also implement both where the user can specify the schema, and if not specified we could construct the schema from the file and raise an error in case of a failure in accessing the file.
601cd13
to
0e69c18
Compare
@mfatihaktas Apologies again for the churn, can you reopen this PR against |
Reopened: #7908 |
Adds the following functions in
ibis/backends/flink/__init__.py
:register()
read_file()
read_parquet()
read_csv()
read_json()
Addition of these functions clears several test functions in
test_param.py
and.test_register.py
Edit: Removed
register()
per this comment.