-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
writing to partitioned table uses the wrong column as partition key #7892
Comments
@alamb I opened a PR (#7896) which in my testing works correctly for your example case, but only in pure rust. In datafusion-cli it still fails. I believe this is because of an additional bug in datafusion-cli that is incorrectly casting the column from dictionary encoded to non-dictionary encoded. Using #7896, the following example runs as expected: use datafusion::{dataframe::DataFrameWriteOptions, prelude::*};
use datafusion_common::DataFusionError;
use object_store::local::LocalFileSystem;
use std::sync::Arc;
use url::Url;
const FILENAME: &str = "/home/dev/arrow-datafusion/input.parquet";
#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let _ctx = SessionContext::new();
let local = Arc::new(LocalFileSystem::new());
let local_url = Url::parse("file://local").unwrap();
_ctx.runtime_env().register_object_store(&local_url, local);
let _read_options = ParquetReadOptions::default();
let _df = _ctx.read_parquet(FILENAME, _read_options.clone()).await?;
_df.clone().show_limit(10).await?;
println!("{}", _df.clone().schema());
_ctx.sql(
"create external table
test(partition string, trace_id varchar)
stored as parquet
partitioned by (partition)
location './temptest2/'
options (create_local_path 'true');",
)
.await?
.collect()
.await?;
_df.clone()
// This select is needed since the schema equivalence check is sensitive to column ordering
.select(vec![col("trace_id"), col("partition")])?
.write_table("test", DataFrameWriteOptions::new())
.await?;
_ctx.sql("select count(1) from test").await?.show().await?;
Ok(())
} |
...while I was looking into: #8493 trying to understand how the hive-partitioned writes are implemented I came across the same issue, having it select the wrong column as the partition-column. So I did a basic test. this is what I suspect so far: The However, when creating the table in A few lines later when the actual The values by the I am not sure if I am correct, and why the partition_cols need to be excluded and then added to the end later on. If this could be handled differently the partition_bug migth be resolved. use datafusion::prelude::*;
use datafusion_common::DataFusionError;
#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let _ctx = SessionContext::new();
_ctx.sql(
"create external table
test(partition string, trace_id varchar)
stored as parquet
partitioned by (partition)
location './temptest2/'
options (create_local_path 'true', insert_mode 'append_new_files',);",
)
.await?
.collect()
.await?;
_ctx.sql(
"
insert into test values
('b', '0btid1'), ('b', '0btid2'), ('c', '0ctid1'), ('c', '0ctid2')
",
)
.await?
.collect()
.await?;
Ok(())
} rb: RecordBatch { schema: Schema { fields: [Field { name: "trace_id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "partition", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [StringArray
[
"b",
"b",
"c",
"c",
], StringArray
[
"0btid1",
"0btid2",
"0ctid1",
"0ctid2",
]], row_count: 4 }
col: partition, col_array: StringArray
[
"0btid1",
"0btid2",
"0ctid1",
"0ctid2",
] |
Hey @marvinlanhenke , thanks for digging into this.
The reason there are two schemas for hive-style partition tables is that by convention the partition columns are excluded from the underlying files. So, in your test table the underlying parquet files would only have 1 column "trace_id". DataFusion has to know the parquet schema and also the table schema and at execution time "fill in" the partition column based on the file path of the parquet files. In the Demux code, this comes into play in the As for the error in your example, DataFusion is moving the partition_by columns to the end of the table schema regardless of how you declare the table. We may want to update the Create External Table syntax to throw an error if you do not place the partition_by columns at the end, to avoid this confusion. See this note in our insert_to_external sqllogic tests: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/insert_to_external.slt#L178... In fact in HiveQL syntax, it is an error to declare the partition() columns in the body of the table as well. https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-PartitionedTables I think we could avoid confusion/bugs around schema ordering and partitioning by following this same convention. |
@devinjdangelo
I like this idea. Would have saved me some confusion ;) |
I revisited this on the theory that #9276 fixed it as a side effect. I was wrong and it is still an issue. ❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition) location '/tmp/test/';
0 rows in set. Query took 0.001 seconds.
❯ insert into test select * from 'input.parquet';
#(runs for a very long time and uses wrong column for partitioning)
❯ insert into test select trace_id, partition from 'input.parquet';
+----------+
| count |
+----------+
| 15557151 |
+----------+
1 row in set. Query took 1.501 seconds. As shown above, it seems that the order of the columns in the schema affects whether the result is correct. I think we will need to look into the logic which aligns the schema of the table vs. the stream of data which should be written to the table. Note as well that listing table moves the "partition" column to the end of the schema. So, the second query projects the 'input.parquet' file in the same way which fixes the issue. It would be nice if the Insert planning was smart enough to align the schemas correctly based on the field names and data types automatically... |
So, we actually do have some logic to realign the schemas in
Relevant bit of code is here: https://github.com/apache/arrow-datafusion/blob/581fd98270e69221689ddd4c37566ef620a67167/datafusion/sql/src/statement.rs#L1154-L1160 |
@alamb I've narrowed this down to logical planning for I believe we should add logic to
|
I don't think SQL aligns on field name, but instead only on position I think the two cases are
Does that makes sense? Maybe we can simply add ProjectionExec that does the alignment correctly? Here is an example showing that position is used postgres=# create table foo(x int, y int);
CREATE TABLE
postgres=# insert into foo values (1, 100);
INSERT 0 1 Now, insert into bar via select * from foo, and position is used: postgres=# create table bar (y int, x int);
CREATE TABLE
postgres=# insert into bar select * from foo;
INSERT 0 1
postgres=# select * from bar;
y | x
---+-----
1 | 100
(1 row) Insert too few columns into foo, then pad rest with null but don't align by column name: postgres=# insert into bar select x from foo;
INSERT 0 1
postgres=# select * from bar;
y | x
---+-----
1 | 100
1 |
(2 rows) |
In that case, the existing insert logical planning and partitioned writes are behaving correctly. The reproducer in this issue is correct behavior which is confusing due to
Note that the table schema and the parquet schema are not aligned by position due to ListingTable moving partition columns to the end. HiveQL DDL statements avoid this confusing behavior by having the column declaration moved entirely to the partition by clause: CREATE TABLE page_view(
viewTime INT,
userid BIGINT,
page_url STRING,
ip STRING)
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE; Note that dt and country are ONLY declared in |
I agree with your analysis -- thank you @devinjdangelo -- it is clear. I will close this ticket and let's continue the discussion on #9465 |
Describe the bug
writing to partitioned table uses the wrong column as partition key
To Reproduce
It looks like it used the wrong column as the partition key:
Here is the input:
input.parquet
Here is how I made it:
Expected behavior
I expect a three output directories to be created, for each of the distinct values of
partition_id
,a
,b
andc
Additional context
Found as part of #7801 (review)
@devinjdangelo did some initial investigation: #7801 (comment)
The text was updated successfully, but these errors were encountered: