Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing Execution Metrics #6809

Closed
bubbajoe opened this issue Jun 30, 2023 · 5 comments
Closed

Accessing Execution Metrics #6809

bubbajoe opened this issue Jun 30, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@bubbajoe
Copy link

Is your feature request related to a problem or challenge?

https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/query-aws-s3.rs

In the above example, how would I get information like: download/upload bytes, execution time (per partition), number partitions, etc.

I would like to track queries metrics like to this to measure quotas.

There doesn’t seems be a simple way of doing this, especially when using the higher level sql function.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@bubbajoe bubbajoe added the enhancement New feature or request label Jun 30, 2023
@alamb
Copy link
Contributor

alamb commented Jun 30, 2023

Have you looked at the output of EXPLAIN ANALYZE for your query? That will show you what metrics are available

I don't think we have upload/download bytes but I do think there are a bunch of parquet level metrics like

|                   |   ParquetExec: file_groups={1 group: [[data.parquet]]}, projection=[<cols>], limit=1, metrics=[output_rows=1, elapsed_compute=1ns, bytes_scanned=4763, predicate_evaluation_errors=0, page_index_rows_filtered=0, pushdown_rows_filtered=0, file_open_errors=0, row_groups_pruned=0, file_scan_errors=0, num_predicate_creation_errors=0, time_elapsed_processing=7.350966ms, time_elapsed_opening=1.363919ms, pushdown_eval_time=2ns, time_elapsed_scanning_until_data=6.981077ms, page_index_eval_time=2ns, time_elapsed_scanning_total=6.981147ms] |

We can probably add more

@bubbajoe
Copy link
Author

bubbajoe commented Jul 1, 2023

@alamb Thanks. Would bytes_scanned not be equal/close to the download amount?

How can i get this information using the Rust API?

@bubbajoe
Copy link
Author

bubbajoe commented Jul 1, 2023

Additionally, i would like to run queries and return the results of those queries all while recording the metrics. Is this possible currently?

@alamb
Copy link
Contributor

alamb commented Jul 2, 2023

Would bytes_scanned not be equal/close to the download amount?

I think so

How can i get this information using the Rust API?

You can access the metrics using https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.metrics

You can walk the ExecutionPlan (either before/during/after execution) and find the relevant ParquetExecNode

Here is some code in IOx that walks these metrics and does something with them (in this case, converts them into "tracing spans" format). Perhaps that is helpful: https://github.com/influxdata/influxdb_iox/blob/4a1f8db2546d867c759f76ab2a2b2b7c8f3dac9c/iox_query/src/exec/query_tracing.rs#L93-L214

@alamb
Copy link
Contributor

alamb commented Mar 1, 2024

I think #9415 covers a more general way to get distributed access.

I don't think this ticket is tracking anything actionable now, so closing. Please reopen if you disagree.

@alamb alamb closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants