-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor reported performance of DataFusion against DuckDB and Hyper #5942
Comments
There was a question on slack asking if the Python bindings enable parallel scans. The Python SessionContext constructor is here. It uses https://github.com/apache/arrow-datafusion-python/blob/main/src/context.rs#L216-L233 |
Looking at https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/19.0.0.md , it should be enabled since 19.0.0, while the article mentions it's using 20.0.0 |
FYI @djfrancesco we are looking into your article. |
@Dandandan thanks for your interest. I just created a public repo with the code used for the post : https://github.com/aetperf/sql_olap_bench python tpch_bench.py -d path_to_the_folder_tpch_100 It uses a function called |
No worries -- clearly if you can't get good performance out of DataFusion we have some things to improve (even if it is only UX). Thank you for the article |
Anybody has a preliminary idea why the figures look this bad? Mostly config related, algorithmic issues or code structure/performance related? |
Related older post (in a sense that trying Datafusion with defaults yields questionable results): #5141 |
Looks like the blog is broken, I can not open it now. |
I took a quick look at the code, and have two observations:
|
Here is an example of configuring a context in Python: runtime = (
RuntimeConfig().with_disk_manager_os().with_fair_spill_pool(10000000)
)
config = (
SessionConfig()
.with_create_default_catalog_and_schema(True)
.with_default_catalog_and_schema("foo", "bar")
.with_target_partitions(1)
.with_information_schema(True)
.with_repartition_joins(False)
.with_repartition_aggregations(False)
.with_repartition_windows(False)
.with_parquet_pruning(False)
.set("datafusion.execution.parquet.pushdown_filters", "true")
)
ctx = SessionContext(config, runtime) |
I created a quick PR to improve the docs around creating a context and will spend more time on this over the next few days. The UX could be improved. |
Thanks for your help @andygrove! I tried versions 20 and 21 and got this Error when using a SessionConfig instance:
Maybe it is a new feature ? |
Sorry @mingmwang we have some troubles with Wordpress sometimes... Here is another version : https://aetperf.github.io/2023/03/30/TPC-H-benchmark-of-Hyper,-DuckDB-and-Datafusion-on-Parquet-files.html |
Yes, sorry .. the config = (
SessionConfig({ 'datafusion.execution.parquet.pushdown_filters': 'true' })
.with_create_default_catalog_and_schema(True)
.with_default_catalog_and_schema("foo", "bar")
.with_target_partitions(8)
.with_information_schema(True)
.with_repartition_joins(False)
.with_repartition_aggregations(False)
.with_repartition_windows(False)
.with_parquet_pruning(False)
) |
Hi @djfrancesco, how many partitions do you set for each engine? Are they the same? And could you share the cpu usage? |
FWIW I am running the scripts in https://github.com/aetperf/sql_olap_bench now on my own machine (they are very nicely written @djfrancesco ) and will report back any interesting findings My first plan is to reproduce the datafusion query performance results using datafusion python and datafusion cli directly and see if they are comparable |
@yahoNanJing I have the default settings regarding partitions, for each engine. Here is a graph of CPU usage in the case of scale factor 10: |
So I have two interesting pieces of information:
I also see DataFusion using only a single core
However, when I run datafusion against "datafusion created" parquet files from https://github.com/apache/arrow-datafusion/tree/main/benchmarks it is 3x faster though much less fast than hyper (like 100ms vs 928ms)
|
The explain plan clearly shows DataFusion trying to parallelize the scan:
Here is the duckdb generated schema:
Here is the datafusion generated schema:
|
So it seems to me there are two follow ups to explore:
|
I will write a smaller self contained reproducer for the "slow query on a single file" and maybe someone clever can look at that. |
I wonder if running parquet-layout against the parquet file might prove insightful. DataFusion is currently limited to row group level parallelism, and there certainly are parquet writers that write very large row groups which would cause issues for this - apache/arrow#34280. Longer-term I would like to eventually get back to #2504 but that is not likely in the next couple of months. |
The layout is on: #5995 (comment) |
|
I do not know -- the parquet files for these tests are created within duckdb somehow (it looks like maybe they have included the |
Filed #5995 with a reproducer for the "one core scanning" issue |
Yes I used the DuckDB tpch extension : https://github.com/duckdb/duckdb/tree/master/extension/tpch |
#5997 should help with this, the file range logic was partitioning row groups based on the location of their ColumnMetadata, which is normally written at the end of a ColumnChunk, not their actual data. This causes issues because duckdb appears to not write the un-inlined metadata and instead just writes a file offset of 0, causing all the row groups to be placed in the first partition. Technically this is probably not spec-compliant of DuckDB, but it's more reliable for us to partition based on where the actual page data is anyway. |
@Mytherin maybe this is interesting for you |
I think
|
It is also what is specified in the TPC-H document pdf:
|
Thanks, this is good news 🎉 |
The flexibility of the parquet file causes different Writers to use different file generation strategies. The data in a Parquet file can be spread over the row groups and the pages using any encoding and compression the writer or user wants. If the physical layout of the parquet file affects the way different query engines I also saw this issue in this paper: https://dl.gi.de/bitstream/handle/20.500.12116/40316/B3-1.pdf?sequence=1&isAllowed=y
|
In the TPC-H documentation, const char *LineitemInfo::Columns[] = {"l_orderkey", "l_partkey", "l_suppkey", "l_linenumber",
"l_quantity", "l_extendedprice", "l_discount", "l_tax",
"l_returnflag", "l_linestatus", "l_shipdate", "l_commitdate",
"l_receiptdate", "l_shipinstruct", "l_shipmode", "l_comment"};
const LogicalType LineitemInfo::Types[] = {
LogicalType(LogicalTypeId::INTEGER), LogicalType(LogicalTypeId::INTEGER), LogicalType(LogicalTypeId::INTEGER),
LogicalType(LogicalTypeId::INTEGER), LogicalType(LogicalTypeId::INTEGER), LogicalType::DECIMAL(15, 2),
LogicalType::DECIMAL(15, 2), LogicalType::DECIMAL(15, 2), LogicalType(LogicalTypeId::VARCHAR),
LogicalType(LogicalTypeId::VARCHAR), LogicalType(LogicalTypeId::DATE), LogicalType(LogicalTypeId::DATE),
LogicalType(LogicalTypeId::DATE), LogicalType(LogicalTypeId::VARCHAR), LogicalType(LogicalTypeId::VARCHAR),
LogicalType(LogicalTypeId::VARCHAR)}; |
FWIW this was a bug and should be fixed by apache/arrow#34280. DuckDB not making use of the hierarchical layout is perplexing, I wonder if this was an intentional design decision, it certainly isn't in the spirit of the specification |
I don't understand this statement This is my understanding of the difference:
Wouldn't the DuckDB implementation actually potentially be better (as it allows the metadata to be fetched with one contiguous IO. Perhaps I am mis understanding |
Was in response to a single page per row group, not the fact it also has a dubious interpretation of the statistics specification. Inlining statistics makes a lot of sense, single page per row group seems to sacrifice a lot |
Proposal for making the settings "fast by default": #6287 |
I think we have addressed this particular concern a long time ago -- we can work on additional performance improvements as we find them. CLosing this one as I don't think it is actionable now |
Describe the bug
There is a blog post that reports relatively poor performance of DataFusion compared to DuckDB and Hyper:
https://www.architecture-performance.fr/ap_blog/tpc-h-benchmark-of-hyper-duckdb-and-datafusion-on-parquet-files/
To Reproduce
I would like someone to try and reproduce the DataFusion performance reported in the blog and propose ways to improve the performance of DataFusion (perhaps by enabling some of the options that are off by default)
Expected behavior
No response
Additional context
@Dandandan suggests on slack that with parallelized scan of a parquet file this benchmark may go faster
The text was updated successfully, but these errors were encountered: