Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix #8904] Improved output of datafuson-cli with empty results #8947

Closed
wants to merge 1 commit into from

Conversation

Tangruilin
Copy link
Contributor

Which issue does this PR close?

Closes #8904 .

Rationale for this change

What changes are included in this PR?

Improved output of datafuson-cli with empty results

Are these changes tested?

There is alreay tests

Are there any user-facing changes?

@Tangruilin
Copy link
Contributor Author

image

@Tangruilin
Copy link
Contributor Author

@alamb maybe you can look at it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Tangruilin -- this is a good start

I am worried about this approach as it seems to require that a plan produce an empty RecordBatch when it makes now rows, but I don't think they all do so (some will continue to not make any)

What would you think about changing the display logic in datafusion-cli to detect the case when there have been no record batches printed and in that case add / print an empty batch?

Perhaps changing this code:

https://github.com/apache/arrow-datafusion/blob/795c71f3e8249847593a58dafe578ffcbcd3e012/datafusion-cli/src/exec.rs#L203-L259

@Tangruilin
Copy link
Contributor Author

Thank you @Tangruilin -- this is a good start

I am worried about this approach as it seems to require that a plan produce an empty RecordBatch when it makes now rows, but I don't think they all do so (some will continue to not make any)

What would you think about changing the display logic in datafusion-cli to detect the case when there have been no record batches printed and in that case add / print an empty batch?

Perhaps changing this code:

https://github.com/apache/arrow-datafusion/blob/795c71f3e8249847593a58dafe578ffcbcd3e012/datafusion-cli/src/exec.rs#L203-L259

but how can i get the rows that is selected if I change exec_and_print . Do you have some suggestion?

@berkaysynnada
Copy link
Contributor

I also believe we should not overfit when propagating empty batches during execution. In print_batches, there is a counter for printed rows. Perhaps we can make use of it.

@Tangruilin
Copy link
Contributor Author

I also believe we should not overfit when propagating empty batches during execution. In print_batches, there is a counter for printed rows. Perhaps we can make use of it.

could you show me the code.
It seems that I can not get counter while printing rows

@berkaysynnada
Copy link
Contributor

I also believe we should not overfit when propagating empty batches during execution. In print_batches, there is a counter for printed rows. Perhaps we can make use of it.

could you show me the code. It seems that I can not get counter while printing rows

https://github.com/apache/arrow-datafusion/blob/827668ab3d4659329ff30e6672f398c934b803be/datafusion-cli/src/print_options.rs#L108

We could perhaps insert some logic here, such that if no row has been printed, an empty batch is printed.

@Tangruilin
Copy link
Contributor Author

I also believe we should not overfit when propagating empty batches during execution. In print_batches, there is a counter for printed rows. Perhaps we can make use of it.

could you show me the code. It seems that I can not get counter while printing rows

https://github.com/apache/arrow-datafusion/blob/827668ab3d4659329ff30e6672f398c934b803be/datafusion-cli/src/print_options.rs#L108

We could perhaps insert some logic here, such that if no row has been printed, an empty batch is printed.

This is a problem that I can not get column data here.

From example:
when select * from (values (1)) where column1 = 0, I can not know the column name is column1 here.

@berkaysynnada
Copy link
Contributor

This is a problem that I can not get column data here.

From example: when select * from (values (1)) where column1 = 0, I can not know the column name is column1 here.

I understand the problem now. When we don't have a record batch, we cannot infer the header columns. Nothing comes to my mind yet, but I believe we can find a better way than propagating empty batches.

@alamb
Copy link
Contributor

alamb commented Jan 23, 2024

I understand the problem now. When we don't have a record batch,

What about using execute_stream instead of collect? The stream that is returned has the schema even if there are no record batches

You could the call https://docs.rs/datafusion/latest/datafusion/physical_plan/common/fn.collect.html if you still needed a Vec of batches

However, it would be great if we could figure out how to print record batches for every plan (even ones that are not "streaming") as they arrive rather than neeting to collect them all together - this would take a bigger refactoring (and need some finagling for the table output

@jonahgao
Copy link
Member

This is a problem that I can not get column data here.

From example: when select * from (values (1)) where column1 = 0, I can not know the column name is column1 here.

Perhaps we can retrieve the column names from the schema of the physical plan.
https://github.com/apache/arrow-datafusion/blob/31b9b48b08592b7d293f46e75707aad7dadd7cbc/datafusion-cli/src/exec.rs#L240
And then pass the schema to print_batches, using it when batches are empty.

let schema = physical_plan.schema();
print_options.print_batches(&results, &schema, now)?;

@Tangruilin
Copy link
Contributor Author

This is a problem that I can not get column data here.
From example: when select * from (values (1)) where column1 = 0, I can not know the column name is column1 here.

Perhaps we can retrieve the column names from the schema of the physical plan.

https://github.com/apache/arrow-datafusion/blob/31b9b48b08592b7d293f46e75707aad7dadd7cbc/datafusion-cli/src/exec.rs#L240

And then pass the schema to print_batches, using it when batches are empty.

let schema = physical_plan.schema();
print_options.print_batches(&results, &schema, now)?;

I may try the way this week

@alamb alamb marked this pull request as draft January 25, 2024 12:01
@alamb
Copy link
Contributor

alamb commented Jan 25, 2024

Marking as draft to note this PR is not waiting on feedback anymore -- please mark it ready for review when next ready.

Thanks again @Tangruilin

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

How is this PR coming along?

@Tangruilin
Copy link
Contributor Author

How is this PR coming along?

Yes. I will submit a PR on Sunday~~~

@Tangruilin
Copy link
Contributor Author

@alamb These days are Chinese New Year.

I will finish my issues after it.

@alamb
Copy link
Contributor

alamb commented Feb 11, 2024

@alamb These days are Chinese New Year.

I will finish my issues after it.

Thank you for the update and I hope your new year is everything you hope it will be 🎉

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 13, 2024
@github-actions github-actions bot closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved output of datafuson-cli with empty results
4 participants