Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Note 3.0.1 #39570

Open
gavinchou opened this issue Aug 19, 2024 · 3 comments
Open

Release Note 3.0.1 #39570

gavinchou opened this issue Aug 19, 2024 · 3 comments

Comments

@gavinchou
Copy link
Contributor

gavinchou commented Aug 19, 2024

Behavior Changes

Query Optimizer

  • Added the variable use_max_length_of_varchar_in_ctas to control the length behavior of VARCHAR type when executing CREATE TABLE AS SELECT (CTAS) operations. #37069
    • This variable is set to true by default.
    • When set to true, if the VARCHAR type column originates from a table, the derived length is used; otherwise, the maximum length is used.
    • When set to false, the VARCHAR type will always use the derived length.
  • All data types will now be displayed in lowercase to maintain compatibility with MySQL format. #38012
  • Multiple query statements in the same query request must now be separated by semicolons. #38670

Query Execution

  • The default number of parallel tasks after shuffle operations in the cluster is set to 100, which will improve query stability and concurrent processing capability in large clusters. #38196

Storage

  • The default value of trash_file_expire_time_sec has been changed from 86400 seconds to 0 seconds, which means that if files are deleted by mistake and the FE trash is cleared, the data cannot be recovered.
  • The table attribute enable_mow_delete_on_delete_predicate (introduced in version 3.0.0) has been renamed to enable_mow_light_delete.
  • Explicit transactions are now prohibited from performing delete operations on tables with written data.
  • Heavy schema change operations are prohibited on tables with auto-increment fields.

New Features

Job Scheduling

  • Optimized the execution logic of internal scheduling jobs, decoupling the strong association between start time and immediate execution parameters. Now, tasks can be created with a specified start time or selected for immediate execution, without conflict, enhancing scheduling flexibility. #36805

Compute-Storage Decoupled

  • Supports dynamic modification of the upper limit for file cache usage. #37484
  • Recycler now supports object storage rate limiting and server-side rate limiting retry functionality. #37663 #37680

Lakehouse

  • Added the session variable serde_dialect to set the output format for complex types. #37039
  • SQL interception now supports external tables.
  • Insert overwrite now supports Iceberg tables. #37191

Asynchronous Materialized Views

  • Supports partition roll-up and build at the hourly level. #37678
  • Supports atomic replacement of asynchronous materialized view definition statements. #36749
  • Transparent rewriting now supports Insert statements. #38115
  • Transparent rewriting now supports the VARIANT type. #37929

Query Execution

  • The group concat function now supports DISTINCT and ORDER BY options. #38744

Semi-Structured Data Management

  • The ES Catalog now maps nested or object types in Elasticsearch to the JSON type in Doris. #37101
  • Added the MULTI_MATCH function, which supports matching keywords across multiple fields and can leverage inverted indexes to accelerate searches. #37722
  • Added the explode_json_object function, which can unfold objects in JSON data into multiple rows. #36887
  • Inverted indexes now support memtable advancement, requiring index construction only once during multi-replica writes, reducing CPU consumption and improving performance. #35891
  • Added MATCH_PHRASE support for positive slop, e.g., msg MATCH_PHRASE 'a b 2+' can match instances containing words a and b with a slop of no more than two, and a preceding b; regular slop without the final + does not guarantee this order. #36356

Other

  • Added the FE parameter skip_audit_user_list, where user operations specified in this configuration will not be recorded in the audit log. #38310
    • For more information, refer to the documentation on Audit Plugin.

Improvements

Storage

  • Reduced the likelihood of write failures caused by disk balancing within a single BE. #38000
  • Decreased memory consumption by the memtable limiter. #37511
  • Moved old partitions to the FE trash during partition replacement operations. #36361
  • Optimized memory consumption during compaction. #37099
  • Added a session variable to control audit logs for JDBC PreparedStatement, with default setting to not print. #38419
  • Optimized the logic for selecting BEs for group commits. #35558
  • Improved the performance of column updates. #38487
  • Optimized the use of delete bitmap cache. #38761
  • Added a configuration to control query affinity during hot and cold tiering. #37492

Compute-Storage Decoupled

  • Implemented automatic retries when encountering object storage server rate limiting. #37199
  • Adapted the number of threads for memtable flush in the compute-storage decoupled mode. #38789
  • Added Azure as a compile option to support compilation in environments without Azure support.
  • Optimized the observability of object storage access rate limiting. #38294
  • Allowed the file cache TTL queue to perform LRU eviction, enhancing TTL queue usability. #37312
  • Optimized the number of balance writeeditlog IO operations in the storage and compute separation mode. #37787
  • Improved table creation speed in the storage and compute separation mode by sending tablet creation requests in batches. #36786
  • Optimized read failures caused by potential inconsistencies in the local file cache through backoff retries. #38645

Lakehouse

  • Optimized memory statistics for Parquet/ORC format read and write operations. #37234
  • Trino Connector Catalog now supports predicate pushdown. #37874
  • Added a session variable enable_count_push_down_for_external_table to control whether to enable count(*) pushdown optimization for external tables. #37046
  • Optimized the read logic for Hudi snapshot reads, returning an empty set when the snapshot is empty, consistent with Spark behavior. #37702
  • Improved the read performance of partition columns for Hive tables. #37377

Asynchronous Materialized Views

  • Improved transparent rewrite plan speed by 20%. #37197
  • Eliminated roll-up during transparent rewrite if the group key satisfies data uniqueness for better nested matching. #38387
  • Transparent rewrite now performs better aggregation elimination to improve the matching success rate of nested materialized views. #36888

MySQL Compatibility

  • Now correctly populates the database name, table name, and original name in the MySQL protocol result columns. #38126
  • Supported the hint format /*+ func(value) */. #37720

Query Optimizer

  • Significantly improved the plan speed for complex queries. #38317
  • Adaptively chose whether to perform bucket shuffle based on the number of data buckets to avoid performance degradation in extreme cases. #36784
  • Optimized the cost estimation logic for SEMI / ANTI JOIN. #37951 #37060
  • Supported pushing Limit down to the first stage of aggregation to improve performance. #34853
  • Partition pruning now supports filter conditions containing the date_trunc or date function. #38025 #38743
  • SQL cache now supports query scenarios that include user variables. #37915
  • Optimized error messages for invalid aggregation semantics. #38122

Query Execution

  • Adapted AggState compatibility from 2.1 to 3.x and fixed Coredump issues. #37104
  • Refactored the strategy selection for local shuffle without Join. #37282
  • Modified the scanner for internal table queries to be asynchronous to prevent stalling during such queries. #38403
  • Optimized the block merge process during Hash table construction for Join operators. #37471
  • Optimized the duration of lock holding for MultiCast. #37462
  • Optimized gRPC keepAliveTime and added link monitoring to reduce the probability of query failure due to RPC errors. #37304
  • Cleaned up all dirty pages in jemalloc when memory limits were exceeded. #37164
  • Optimized the processing performance of aes_encrypt/decrypt functions for constant types. #37194
  • Optimized the processing performance of the json_extract function for constant data. #36927
  • Optimized the processing performance of the ParseUrl function for constant data. #36882

Semi-Structured Data Management

  • Bitmap indexes now default to using inverted indexes, with enable_create_bitmap_index_as_inverted_index set to true by default. #36692
  • In the compute-storage decoupled mode, DESC can now view sub-columns of VARIANT type. #38143
  • Removed the step of checking file existence during inverted index queries to reduce access latency to remote storage. #36945
  • Complex types ARRAY / MAP / STRUCT now support replace_if_not_null for AGG tables. #38304
  • Escape characters for JSON data are now supported. #37176 #37251
  • Inverted index queries now behave consistently on MOW tables and DUP tables. #37428
  • Optimized the performance of inverted index acceleration for IN queries. #37395
  • Reduced unnecessary memory allocation during TOPN queries to improve performance. #37429
  • When creating an inverted index with tokenization, the support_phrase option is now automatically enabled to accelerate match_phrase series phrase queries. #37949

Other

  • Audit logs can now record SQL types. #37790
  • Added support for information_schema.processlist to show all FE. #38701
  • Cached ranger's atamask and rowpolicy to accelerate query efficiency. #37723
  • Optimized metadata management in job manager to release locks immediately after modifying metadata, reducing lock holding time. #38162

Bug Fixes

Upgrade

  • Fix the issue where mtmv load fails during upgrade from version 2.1. #38799
  • Resolve the issue where null_type cannot be found during the upgrade to version 2.1. #39373
  • Address the compatibility issue with permission persistence during the upgrade from version 2.1 to 3.0. #39288

Load

  • Fix the issue where parsing fails when the newline character is surrounded by delimiters in CSV format parsing. #38347
  • Resolve potential exception issues when FE forwards group commit. #38228 #38265
  • Group commit now supports the new optimizer. #37002
  • Fix the issue where group commit reports data errors when JDBC setNull is used. #38262
  • Optimize the retry logic for group commit when encountering delete bitmap lock errors. #37600
  • Resolve the issue where routine load cannot use CSV delimiters and escape characters. #38402
  • Fix the issue where routine load job names with mixed case cannot be displayed. #38523
  • Optimize the logic for actively recovering routine load during FE master-slave switching. #37876
  • Resolve the issue where routine load pauses when all data in Kafka is expired. #37288
  • Fix the issue where show routine load returns empty results. #38199
  • Resolve the memory leak issue during multi-table stream import in routine load. #38255
  • Fix the issue where stream load does not return the error URL. #38325
  • Resolve potential load channel leak issues. #38031 #37500
  • Fix the issue where no error may be reported when importing fewer segments than expected. #36753
  • Resolve the load stream leak issue. #38912
  • Optimize the impact of offline nodes on import operations. #38198
  • Fix the issue where transactions do not end when inserting into empty data. #38991

Storage

Backup and Restoration

  • Fix the issue where tables cannot be written after backup and restoration. #37089
  • Resolve the issue where view database names are incorrect after backup and restoration. #37412

Compaction

  • Fix the issue where cumu compaction handles delete errors incorrectly during ordered data compression. #38742
  • Resolve the issue of duplicate keys in aggregate tables caused by sequential compression optimization. #38224
  • Fix the issue where compression operations cause coredump in large wide tables. #37960
  • Resolve the compression starvation issue caused by inaccurate concurrent statistics of compression tasks. #37318

MOW Unique Key

  • Resolve the issue of inconsistent data between replicas caused by cumulative compression deletion of delete sign. #37950
  • MOW delete now uses partial column updates with the new optimizer. #38751
  • Fix the potential duplicate key issue in MOW tables under compute-storage decoupled. #39018
  • Resolve the issue where MOW unique and duplicate tables cannot modify column order. #37067
  • Fix the potential data correctness issue caused by segcompaction. #37760
  • Resolve the potential memory leak issue during column updates. #37706

Other

  • Fix the small probability of exceptions in TOPN queries. #39119 #39199
  • Resolve the issue where auto-increment IDs may duplicate during FE restart. #37306
  • Fix the potential queuing issue in the delete operation priority queue. #37169
  • Optimize the delete retry logic. #37363
  • Resolve the issue with bucket = 0 in table creation statements under the new optimizer. #38971
  • Fix the issue where FE reports success incorrectly when image generation fails. #37508
  • Resolve the issue where using the wrong nodename during FE offline nodes may cause inconsistent FE members. #37987
  • Fix the issue where CCR partition addition may fail. #37295
  • Resolve the int32 overflow issue in inverted index files. #38891
  • Fix the issue where TRUNCATE TABLE failure may cause BE to fail to go offline. #37334
  • Resolve the issue where publish cannot continue due to null pointers. #37724 #37531
  • Fix the potential coredump issue when manually triggering disk migration. #37712

Compute-Storage Decoupled

  • Fixed the issue where show create table might display the file_cache_ttl_seconds attribute twice. #38052
  • Fixed the issue where segment Footer TTL was not set correctly after setting file cache TTL. #37485
  • Fixed the issue where file cache might cause coredump due to massive conversion of cache types. #38518
  • Fixed the potential file descriptor (fd) leak in file cache. #38051
  • Fixed the issue where schema change Job overwriting compaction Job prevented base tablet compaction from completing normally. #38210
  • Fixed the potential inaccuracy of base compaction score due to data race. #38006
  • Fixed the issue where error messages from imports might not be uploaded correctly to object storage. #38359
  • Fixed the inconsistency in return information between compute-storage decoupled mode and storage and compute integration mode for 2PC imports. #38076
  • Fix the issue where incorrect file size setting during file cache warm-up leads to coredump. #38939
  • Fixed the issue where partial column updates did not correctly dequeue delete operations. #37151
  • Fixed compatibility issues with permission persistence in compute-storage decoupled mode. #38136 #37708
  • Fixed the issue where observer did not retry correctly when encountering a -230 error. #37625
  • Fixed the issue where show load with conditions did not perform correct analysis. #37656
  • Fixed the issue where show streamload in compute-storage decoupled mode caused BE coredump. #37903
  • Fixed the issue where copy into did not correctly verify column names in strict mode. #37650
  • Fixed the issue where multi-stream imports into a single table lacked permissions. #38878
  • Fixed the potential overflow issue in getVersionUpdateTimeMs. #38074
  • Fixed the issue where FE azure blob list was not implemented correctly. #37986
  • Fixed the issue where inaccurate azure blob recycling time calculation prevented recycling. #37535
  • Fixed the issue where inverted index files were not deleted in compute-storage decoupled mode. #38306

Lakehouse

  • Fixed the issue by reading binary data from Oracle Catalog. #37078
  • Fixed the potential deadlock issue when acquiring external table metadata in multi-FE scenarios. #37756
  • Fixed the issue where JNI scanner failure caused BE nodes to crash. #37697
  • Fixed the issue with slow reading of date types from Trino Connector Catalog. #37266
  • Optimized kerberos authentication logic for Hive Catalog. #37301
  • Fixed the issue where region attributes might be parsed incorrectly when parsing MinIO properties. #37249
  • Fixed the issue where creating too many FileSystems by FE caused memory leaks. #36954
  • Fixed the issue by reading incorrect time zone information from Paimon. #37716
  • Fixed the potential thread leak issue caused by Hive write-back operations. #36990
  • Fixed the null pointer issue caused by enabling Hive metastore event synchronization. #38421
  • Fixed the issue where error messages were unclear or caused stalling when creating catalogs. #37551
  • Fixed the issue where reading Hive text format tables behaved differently from Hive. #37638
  • Fixed the logic error when switching between catalogs and databases. #37828

MySQL Compatibility

  • Fixed the issue where certain flags in the MySQL protocol were set incorrectly when SSL was enabled. #38086

Asynchronous Materialized Views

  • Fixed the issue where construction might fail when the base table had a very large number of partitions. #37589
  • Fixed the issue where nested materialized views incorrectly performed full table refreshes even when partition refreshes were possible. #38698
  • Fixed the issue where partition refresh could not handle the simultaneous existence of valid and invalid dependencies when analyzing partition dependencies. #38367
  • Fixed the issue where the final result containing NULL type might cause asynchronous materialized views to fail. #37019
  • Fixed the planning error that might occur during transparent rewriting when both synchronous and asynchronous materialized views with the same name were present. #37311

Synchronous Materialized Views

  • The rewritten synchronous materialized views now can correctly perform partition pruning. #38527
  • When rewriting synchronous materialized views, those with unready data are no longer selected. #38148

Query Optimizer

  • Fixed the deadlock issue that might occur when queries and delete operations are performed simultaneously. #38660
  • Fixed the issue where bucket pruning might incorrectly prune on decimal column buckets. #37889
  • Fixed the issue where planning might be incorrect when mark Join participates in Join reorder. #39152
  • Fixed the issue where the result is incorrect when the correlation condition of a correlated subquery is not a simple column. #37644
  • Fixed the issue where partition pruning cannot correctly handle or express. #38897
  • Fixed the planning error that might occur when optimizing the execution order of JOIN and AGG. #37343
  • Fixed the issue where str_to_date performs incorrect constant folding calculations on DATEV1 types. #37360
  • Fixed the issue where the ACOS function's constant folding returns non-NaN values. #37932
  • Fixed the occasional planning error: "The children format needs to be [WhenClause+, DefaultValue?]". #38491
  • Fixed the issue where planning might be incorrect when the projection includes window functions and there is both the original column and its alias. #38166
  • Fixed the issue where planning might report an error when the aggregation parameter contains a lambda expression. #37109
  • Fixed the Insert error that might occur in extreme cases: "MultiCastDataSink cannot be cast to DataStreamSink". #38526
  • Fixed the issue where the new optimizer does not correctly handle char(0)/varchar(0) when creating a table. #38427
  • Fixed the incorrect behavior of char(255) toSql. #37340
  • Fixed the issue where the nullable attribute within the agg_state type might lead to planning errors. #37489
  • Fixed the issue where row count statistics are inaccurate during mark Join. #38270

Query Execution

  • Fixed issues where the Pipeline execution engine was stuck, causing queries to not end, in multiple scenarios. #38657, #38206, #38885, #38151, #37297
  • Fixed the coredump issue caused by null and non-null columns during set difference calculations. #38750
  • Fixed the error when using the DECIMAL type with pure decimals in delete statements. #37801
  • Fixed the issue where the width_bucket function returned incorrect results. #37892
  • Fixed the query error when a single row of data was very large and the result set was also large (exceeding 2GB). #37990
  • Fixed the coredump issue caused by incorrect release of rpc connections during single-replica imports. #38087
  • Fixed the coredump issue caused by processing NULL values with the foreach function. #37349
  • Fixed the issue where stddev returned incorrect results for DECIMALV2 types. #38731
  • Fixed the slow performance of bitmap union calculations. #37816
  • Fixed the issue where RowsProduced for aggregation operators was not set in the profile. #38271
  • Fixed the overflow issue when calculating the number of buckets for the hash table under hash join. #37193, #37493
  • Fixed the inaccurate recording of the jemalloc cache memory tracker. #37464
  • Added the enable_stacktrace configuration option, allowing users to control whether exception stacks are output in BE logs. #37713
  • Fixed the issue where Arrow Flight SQL did not work correctly when enable_parallel_result_sink was set to false. #37779
  • Fixed the incorrect use of colocate Join. #37361, #37729
  • Fixed the calculation overflow issue of the round function on DECIMAL128 types. #37733, #38106
  • Fixed the coredump issue when passing a const string to the sleep function. #37681
  • Increased the queue length for audit logs, solving the issue where audit logs could not be recorded normally under high concurrency scenarios with thousands of concurrent connections. #37786
  • Fixed the issue where creating a workload group caused too many threads, leading to BE coredump. #38096
  • Fixed the coredump issue caused by the MULTI_MATCH_ANY function. #37959
  • Fixed the transaction rollback issue caused by insert overwrite auto partition. #38103
  • Fixed the issue where the TimeUtils formatter did not use the correct time zone. #37465
  • Fixed the issue where results were incorrect under constant folding scenarios for week/yearweek. #37376
  • Fixed the issue where the convert_tz function returned incorrect results. #37358, #38764
  • Fixed the coredump issue when using the collect_set function with window functions. #38234
  • Fixed the coredump issue caused by percentile_approx during rolling upgrades. #39321
  • Fixed the coredump issue caused by the mod function when encountering abnormal input. #37999
  • Fixed the issue where the hash table was not fully built when the broadcast Join probe started running. #37643
  • Fixed the issue where executing the same expression in multithreaded environments might lead to incorrect results for Java UDFs. #38612
  • Fixed the overflow issue caused by incorrect return types of the conv function. #38001
  • Fixed the issue where the json_replace function returned incorrect types. #3701
  • Fixed the issue where the nullable attribute setting was unreasonable for the percentile aggregation function. #37330
  • Fixed the issue where the results of the histogram function were unstable. #38608
  • Fixed the issue where Task State was displayed incorrectly in the profile. #38082
  • Fixed the issue where some queries were incorrectly canceled when the system just started. #37662

Semi-Structured Data Management

  • Fix some issues with time series compression. #39170 #39176
  • Fix the issue of incorrect index size statistics during compression. #37232
  • Fix the potential incorrect matching of ultra-long strings without tokenization in inverted indexes. #37679 #38218
  • Fix the high memory usage issue of array_range and array_with_const functions when dealing with large data volumes. #38284 #37495
  • Fix the potential coredump issue when selecting columns of ARRAY / MAP / STRUCT types. #37936
  • Fix the import failure issue caused by simdjson parsing errors when specifying jsonpath in Stream Load. #38490
  • Fix the exception handling issue when there are duplicate keys in JSON data. #38146
  • Fix the potential query error after DROP INDEX. #37646
  • Fix the error return issue in row merging checks during index compression. #38732
  • Inverted index v2 format now supports renaming columns. #38079
  • Fix the coredump issue when the MATCH function matches an empty string without an index. #37947
  • Fix the handling of NULL values in inverted indexes. #37921 #37842 #38741
  • Fix the incorrect row_store_page_size after FE restart. #38240

Other

  • Fix the timezone configuration issue. The default timezone is no longer fixed at UTC+8 and is now obtained from system configuration. #37294
  • Fix the class conflict issue when using ranger due to multiple JSR specification implementations. #37575
  • Fix the potential uninitialized field issue in some BE code. #37403
  • Fix the error in delete statements for random distributed tables. #37985
  • Fix the incorrect requirement for alter_priv permission on the base table when creating a synchronized materialized view. #38011
  • Fix the issue of not authenticating resources when used in TVF. #36928
@gavinchou gavinchou changed the title 3.0.1 release note [WIP] 3.0.1 release note Aug 19, 2024
@gavinchou
Copy link
Contributor Author

行为变更

优化器

  • 新增加了一个变量 use_max_length_of_varchar_in_ctas,用于控制在执行 CREATE TABLE AS SELECT(CTAS)操作时 VARCHAR 类型的长度行为。此变量默认设置为 true。当设置为 true 时,如果 VARCHAR 类型的列源自一个表,则采用推导长度;如果不是,则使用最大长度。当设置为 false 时,VARCHAR 类型将始终使用推导出的长度。#37069
  • 现在,所有的数据类型将以小写形式展示,以保持与 MySQL 格式的兼容性。#38012
  • 同一查询请求中的多条查询语句现在必须使用分号分隔。#38670

查询执行

  • 将集群在执行 shuffle 操作后默认的并行任务数设置为 100,这将提高大型集群中查询的稳定性和并发处理能力。#38196

存储

  • trash_file_expire_time_sec 的默认值已从 86400 秒更改为 0 秒,这意味着如果误删除文件并清空了 FE 回收站,数据将无法恢复。
  • 表属性 enable_mow_delete_on_delete_predicate(在版本 3.0.0 中引入)已更名为 enable_mow_light_delete
  • 显式事务现在被禁止对已写入数据的表执行 delete 操作。
  • 禁止对含有自增字段的表进行重量级的 schema change 操作。

新特性

任务调度

  • 优化了内部调度作业的执行逻辑,取消了开始时间和立即执行参数之间的强关联。现在,任务在创建时可以指定开始时间或选择立即执行,两者不再冲突,从而提高了调度的灵活性。#36805

存算分离

  • 支持动态更改 file cache 的使用上限。#37484
  • Recycler 现在支持对象存储限速以及服务端限速重试功能。#37663 #37680

Lakehouse

  • 新增会话变量 serde_dialect,可以设置复杂类型的输出格式。#37039
  • SQL 拦截功能现在支持外部表。文档:SQL 拦截
  • Insert Overwrite 现在支持 Iceberg 表。#37191

异步物化视图

  • 支持按小时级别分区上卷构建。#37678
  • 支持原子替换异步物化视图定义语句。#36749
  • 透明改写现在支持 insert 语句。#38115
  • 透明改写现在支持 variant 类型。#37929

查询执行

  • Group concat 函数现在支持 distinct 和 order by 选项。#38744

半结构化数据管理

  • ES Catalog 现在将 Elasticsearch 中的 nestedobject 类型映射为 Doris 的 JSON 类型。#37101
  • 新增加了 MULTI_MATCH 函数,可以在多个字段中匹配关键词,并能利用倒排索引加速搜索。#37722
  • 新增加了 explode_json_object 函数,可以将 JSON 数据中的 object 展开为多行。#36887
  • 倒排索引现在支持 memtable 前移,在多副本写入时只需构建一次索引,减少 CPU 消耗并提升性能。#35891
  • 新增加了 MATCH_PHRASE 支持正向词距(slop),例如 msg MATCH_PHRASE 'a b 2+' 可以匹配包含词 a 和 b,它们之间的词距不超过两个,并且 a 在 b 的前面;而普通的词距(slop)如果没有最后的加号 +,则不保证 a 在 b 的前面。#36356

其他

  • 新增加了 FE 参数 skip_audit_user_list,在此配置项中的用户操作将不会被记录到审计日志中。#38310
  • 文档:审计插件

改进

存储

  • 降低了单个 BE 内磁盘间均衡导致写失败的可能性。#38000
  • 降低了 memtable limiter 的内存消耗。#37511
  • 在替换分区操作时,将旧分区移动到 FE 回收站。#36361
  • 优化了 compaction 的内存消耗。#37099
  • 增加了会话变量以控制 JDBC prepared stmt 的审计日志,默认不打印。#38419
  • 优化了 group commit 选择 BE 的逻辑。#35558
  • 优化了列更新的性能。#38487
  • 优化了 delete bitmap cache 的使用。#38761
  • 添加了配置以控制冷热分层时查询的亲和性。#37492

存算分离

  • 遇到对象存储服务端限速时,现在会自动重试。#37199
  • 适应存算分离模式下 memtable flush 的线程数。#38789
  • 将 Azure 作为编译选项,以便支持在不支持 Azure 的环境中编译。
  • 优化了对象存储访问限速的可观测性。#38294
  • 允许 file cache TTL 队列进行 LRU 淘汰,增加了 TTL 队列的可用性。#37312
  • 优化了存算分离模式下 balance writeeditlog IO 次数。#37787
  • 优化了存算分离模式下建表的速度,批量发送创建 tablet 的请求。#36786
  • 通过退避重试的方式,优化了本地 file cache 可能不一致时导致的读取失败问题。#38645

Lakehouse

  • 优化了 Parquet/ORC 格式读写操作的内存统计。#37234
  • Trino Connector Catalog 现在支持谓词下推。#37874
  • 新增会话变量 enable_count_push_down_for_external_table,用于控制是否开启外部表的 count(*) 下推优化。#37046
  • 优化了 Hudi 快照读的读取逻辑,当快照为空时返回空集,与 Spark 行为保持一致。#37702
  • 优化了 Hive 表分区列的读取性能。#37377

异步物化视图

  • 透明改写计划速度提升了 20%。#37197
  • 如果 group key 满足数据唯一性,在透明改写时不再进行上卷,以更好地进行嵌套匹配。#38387
  • 透明改写现在可以更好地进行聚合消除,以提高嵌套物化视图的匹配成功率。#36888

MySQL 兼容性

  • 现在正确填充了 MySQL 协议中结果列的库名、表名和原始名称。#38126
  • 支持了形如 /*+ func(value) */ 的 hint 格式。#37720

查询优化器

  • 显著提升了复杂查询的计划速度。#38317
  • 根据数据分桶数量,自适应选择是否进行 bucket shuffle,以避免极端情况下的性能劣化。#36784
  • 优化了 semi/anti join 的代价估算逻辑。#37951 #37060
  • 支持将 limit 下推到第一阶段聚合,以提升性能。#34853
  • 分区裁剪现在支持过滤条件中包含 date_truncdate 函数。#38025 #38743
  • SQL 缓存现在支持包含用户变量的查询场景。#37915
  • 优化了聚合语义不合法时的错误信息。#38122

查询执行

  • 适配了 AggState 的 2.1 到 3.x 兼容性,并修复了 coredump 问题。#37104
  • 重构了不带 join 时 local shuffle 的策略选择。#37282
  • 将内部表查询的 scanner 修改为异步方式,以防止查询内部表时卡住。#38403
  • 优化了 Join 算子构建 hash 表时的 block merge 过程。#37471
  • 优化了 MultiCast 持有锁的时间。#37462
  • 优化了 gRPC 的 keepAliveTime 并增加了链接监测机制,降低了查询过程中因 RPC 错误导致查询失败的概率。#37304
  • 当内存超限时,清理 jemalloc 中的所有 dirty pages。#37164
  • 优化了 aes_encrypt/decrypt 函数对常量类型的处理性能。#37194
  • 优化了 json_extract 函数对常量数据的处理性能。#36927
  • 优化了 ParseUrl 函数对常量数据的处理性能。#36882

半结构化数据管理

  • bitmap 索引现在默认使用反向索引,enable_create_bitmap_index_as_inverted_index 默认设置为 true。#36692
  • 在存算分离模式下,DESC 现在可以查看 variant 类型的子列。#38143
  • 移除了倒排索引查询时检查文件是否存在的步骤,以降低远程存储的访问延迟。#36945
  • ARRAY MAP STRUCT 复杂类型现在支持 AGG 表的 replace_if_not_null。#38304
  • 现在支持 JSON 数据的转义字符。#37176 #37251
  • 倒排索引查询现在在 MOW 表上与 DUP 表一致。#37428
  • 优化了倒排索引加速 IN 查询的性能。#37395
  • TOPN 查询时减少了多余的内存分配,以提升性能。#37429
  • 当创建带分词的倒排索引时,现在自动开启 support_phrase 选项,以加速 match_phrase 系列短语查询。#37949

其他

  • Audit log 现在记录 SQL 类型。#37790
  • 增加了对 information_schema.processlist show all FE 的支持。#38701
  • 缓存了 Ranger 的 datamask 和 rowpolicy,以加速查询效率。#37723
  • 优化了 Job Manager 的元数据管理,在修改元数据后立即释放锁,以减少锁持有时间。#38162

缺陷修复

升级

  • 修复了从 2.1 版本升级时 mtmv load 失败的问题。#38799
  • 修复了在 2.1 版本升级时找不到 null_type 的问题。#39373
  • 修复了从 2.1 版本升级到 3.0 版本时权限持久化的兼容性问题。#39288

导入

  • 修复了 CSV 格式解析中,换行符被包围符包围时解析失败的问题。#38347
  • 修复了 FE 在转发 group commit 时可能出现的异常问题。#38228 #38265
  • group commit 现在支持新优化器。#37002
  • 修复了 JDBC setNull 时 group commit 报告数据错误的问题。#38262
  • 优化了 group commit 遇到 delete bitmap lock 错误时的重试逻辑。#37600
  • 修复了 routineload 不能使用 CSV 包围符和转义符的问题。#38402
  • 修复了 routineload job 名字大小写混用时无法显示的问题。#38523
  • 优化了 FE 主从切换时主动恢复 routineload 的逻辑。#37876
  • 修复了 Kafka 中数据全部过期时 routineload 暂停的问题。#37288
  • 修复了 show routineload 返回空结果的问题。#38199
  • 修复了 routineload 多表流式导入时的内存泄露问题。#38255
  • 修复了 streamload 不返回 error url 的问题。#38325
  • 修复了 load channel 可能泄露的问题。#38031 #37500
  • 修复了导入少于预期的 segment 时可能不报错的问题。#36753
  • 修复了 load stream 泄露的问题。#38912
  • 优化了下线节点对导入操作的影响。#38198
  • 修复了 insert into 空数据情况下事务不结束的问题。#38991

存储

备份与恢复

  • 修复了备份恢复后表无法写入的问题。#37089
  • 修复了备份恢复后视图中数据库名称错误的问题。#37412

Compaction(合并)

  • 修复了有序数据compaction时 cumu compaction 处理 delete 错误的的问题。#38742
  • 修复了顺序compaction优化导致的聚合表重复 key 问题。#38224
  • 修复了大宽表下compaction操作导致 coredump 的问题。#37960
  • 修复了compaction任务并发统计不准确导致的compaction饥饿问题。#37318

MOW Unique Key(MOW 唯一键)

  • 解决了累计compaciton删除 delete sign 导致的副本间数据不一致问题。#37950
  • 在新的优化器下,MOW delete 表现在使用部分列更新。#38751
  • 修复了存算分离下 MOW 表可能出现的重复 key 问题。#39018
  • 修复了 MOW unique 和 dup 表不能修改列顺序的问题。#37067
  • 修复了 segcompaction 可能导致的数据正确性问题。#37760
  • 修复了列更新可能出现的内存泄露问题。#37706

其他:

  • 修复了 TOPN 查询可能出现的小概率异常。#39119 #39199
  • 修复了 FE 重启时自增 id 可能重复的问题。#37306
  • 修复了 delete 操作优先级队列可能的排队问题。#37169
  • 优化了 delete 重试逻辑。#37363
  • 修复了新优化器下建表语句中 bucket = 0 的问题。#38971
  • 修复了 FE 生成 image 失败时错误地报告成功的问题。#37508
  • 修复了 FE 下线节点时使用错误 nodename 可能导致的 FE 成员不一致问题。#37987
  • 修复了 CCR 增加分区可能失败的问题。#37295
  • 修复了倒排索引文件中 int32 溢出的问题。#38891
  • 修复了 truncate table 失败可能导致 BE 不能下线的问题。#37334
  • 修复了因空指针导致的 publish 无法继续的问题。#37724 #37531
  • 修复了手动触发磁盘迁移时可能出现的 coredump 问题。#37712

存算分离

  • 修复了 show create table 可能会展示两次 file_cache_ttl_seconds 属性的问题。#38052
  • 修复了设置 File Cache TTL 后,Segment Footer TTL 未正确设置的问题。#37485
  • 修复了 file cache 因大量转换 cache 类型可能会导致 coredump 的问题。#38518
  • 修复了 file cache 可能会泄漏 fd 的问题。#38051
  • 修复了 schema change job 覆盖 compaction job 导致 base tablet compaction 不能正常完成的问题。#38210
  • 修复了 base compaction score 因 data race 可能会不准确的问题。#38006
  • 修复了导入返回的错误信息可能不能正确上传到对象存储的问题。#38359
  • 修复了存算分离模式和存算一体模式 2PC 导入返回信息不一致的问题。#38076
  • 修复了 file cache 预热未正确设置 file size 导致 coredump 的问题。#38939
  • 修复了部分列更新没有正确出列 delete 的问题。#37151
  • 修复了存算分离模式权限持久化兼容问题。#38136 #37708
  • 修复了 observer 遇到 -230 错误没有进行正确重试的问题。#37625
  • 修复了 show load 带条件时没有正确 analyze 的问题。#37656
  • 修复了存算分离模式下 show streamload 导致 BE coredump 的问题。#37903
  • 修复了 copy into 在严格模式下未正确校验列名的问题。#37650
  • 修复了一表多流导入没有权限的问题。#38878
  • 修复了 getVersionUpdateTimeMs 可能会越界的问题。#38074
  • 修复了 FE azure blob list 没有实现正确的问题。#37986
  • 修复了 azure blob 回收时间计算不准确导致不触发回收的问题。#37535
  • 修复了存算分离模式下倒排索引文件漏删的问题。#38306

Lakehouse

  • 修复了 Oracle Catalog 读取二进制数据的问题。#37078
  • 修复了多 FE 情况下,获取外表元数据可能导致的死锁问题。#37756
  • 修复了 JNI Scanner 打开失败导致 BE 节点宕机的问题。#37697
  • 修复了 Trino Connector Catalog 读取 Date 类型慢的问题。#37266
  • 优化了 Hive Catalog 的 Kerberos 认证逻辑。#37301
  • 修复了解析 MinIO 属性时,region 属性可能解析错误的问题。#37249
  • 修复了 FE 创建过多的 FileSystem 导致内存泄漏的问题。#36954
  • 修复了读取 Paimon 时区信息错误的问题。#37716
  • 修复了 Hive 写回操作可能导致的线程泄漏问题。#36990
  • 修复了开启 Hive Metastore Event 同步功能导致的空指针问题。#38421
  • 修复了创建 Catalog 时报错信息不清晰或卡死的情况。#37551
  • 修复了读取 Hive Text 格式表时与 Hive 行为不一致的问题。#37638
  • 修复了切换 Catalog 和 Database 逻辑错误的问题。#37828

MySQL 兼容性

  • 修复了开启 SSL 后,MySQL 协议中某些 Flag 设置不正确的问题。#38086

异步物化视图

  • 修复了基表分区数量非常多时可能导致的构建失败问题。#37589
  • 修复了构建嵌套物化视图时,即使可以进行分区刷新,也错误地进行了全表刷新的问题。#38698
  • 修复了分区刷新在分析分区依赖时,不能处理同时存在合法和不合法依赖关系的问题。#38367
  • 修复了最终返回结果包含 null type 导致异步物化视图可能构建失败的问题。#37019
  • 当包含同名的同步物化视图和异步物化视图时,透明改写可能出现规划错误。#37311

同步物化视图

  • 现在,改写后的同步物化视图也可以正确地进行分区裁剪。#38527
  • 同步物化视图改写时,不再选择数据未就绪的同步物化视图。#38148

查询优化器

  • 修复了查询和 delete 等操作同时进行可能导致的死锁问题。#38660
  • 修复了分桶裁剪在 decimal 列分桶上可能错误裁剪的问题。#37889
  • 修复了当 mark join 参与 join reorder 时,规划可能出现错误的问题。#39152
  • 修复了关联子查询关联条件不是简单列时,结果错误的问题。#37644
  • 修复了分区裁剪不能正确处理 or 表达式的问题。#38897
  • 修复了当进行 join 和 agg 交换执行顺序的优化时,可能导致的规划报错问题。#37343
  • 修复了 str_to_date 在 datev1 类型上进行常量折叠计算错误的问题。#37360
  • 修复了 acos 函数常量折叠返回非 NaN 的问题。#37932
  • 修复了偶尔出现的规划报错 "The children format needs to be [WhenClause+, DefaultValue?]" 的问题。#38491
  • 修复了当投影中包含窗口函数,且同时存在一个列的原始列和其别名时,规划可能出现错误的问题。#38166
  • 修复了当聚合参数中含有 lambda 表达式,可能导致规划报错的问题。#37109
  • 修复了在极端情况下可能出现的 insert 报错:"MultiCastDataSink cannot be cast to DataStreamSink" 的问题。#38526
  • 修复了创建表时,新优化器对于传入的 char(0)/varchar(0) 没有正确处理的问题。#38427
  • 修复了 char(255) toSql 行为不正确的问题。#37340
  • 修复了 agg_state 类型内部的 nullable 属性可能规划错误的问题。#37489
  • 修复了 mark join 时行数统计不准确的问题。#38270

查询执行

  • 修复了多个场景下,pipeline 执行引擎被卡住导致查询不结束的问题。#38657 #38206 #38885 #38151 #37297
  • 修复了 null 和非 null 列在差集计算时导致的 coredump 问题。#38750
  • 修复了 delete 语句中 decimal 类型为纯小数时报错的问题。#37801
  • 修复了 width_bucket 函数结果错误的问题。#37892
  • 修复了当单行数据很大且返回结果集也很大时(超过 2GB)查询报错的问题。#37990
  • 修复了单副本导入时 rpc 链接没有正确释放导致的 coredump 问题。#38087
  • 修复了 foreach 函数处理 null 导致的 coredump 问题。#37349
  • 修复了 stddev 在 DecimalV2 类型下结果错误的问题。#38731
  • 修复了 bitmap union 计算性能慢的问题。#37816
  • 修复了 profile 中聚合算子的 RowsProduced 没有设置的问题。#38271
  • 修复了 hash join 下计算 hash 表 bucket 数目时溢出的问题。#37193 #37493
  • 修复了 jemalloc cache memory tracker 记录不准确的问题。#37464
  • 增加了配置项 enable_stacktrace,用户可以通过设置此选项来控制 BE 日志中是否输出异常栈。#37713
  • 修复了 arrow flight sql 在设置 enable_parallel_result_sink 为 false 时不能正常工作的问题。#37779
  • 修复了错误地使用 colocate join 的问题。#37361 #37729
  • 修复了 round 函数在 decimal128 类型上计算溢出的问题。#37733 #38106
  • 修复了 sleep 函数传参 const 字符串时的 coredump 问题。#37681
  • 增加了审计日志的队列长度,解决了数千并发场景下审计日志不能正常记录的问题。#37786
  • 修复了创建 workload group 导致的线程数过多,导致 BE coredump 的问题。#38096
  • 修复了 MULTI_MATCH_ANY 函数导致的 coredump 问题。#37959
  • 修复了 insert overwrite auto partition 导致事务回滚的问题。#38103
  • 修复了 TimeUtils formatter 没有使用正确时区的问题。#37465
  • 修复了 week/yearweek 常量折叠场景下结果错误的问题。#37376
  • 修复了 convert_tz 函数结果错误的问题。#37358 #38764
  • 修复了 collect_set 函数结合窗口函数使用时 coredump 的问题。#38234
  • 修复了 percentile_approx 在滚动升级过程中导致的 coredump 问题。#39321
  • 修复了 mod 函数在异常输入时导致的 coredump 问题。#37999
  • 修复了 broadcast join 在 probe 开始运行时 hash table 构建未完成的问题。#37643
  • 修复了多线程下执行相同表达式可能导致 Java UDF 结果错误的问题。#38612
  • 修复了 conv 函数返回类型错误导致的溢出问题。#38001
  • 修复了 json_replace 函数返回类型不正确的问题。#37014
  • 修复了 percentile 聚合函数 nullable 属性设置不合理的问题。#37330
  • 修复了 histogram 函数结果不稳定的问题。#38608
  • 修复了 profile 中 task state 显示不正确的问题。#38082
  • 修复了系统刚启动时部分 query 被错误取消的问题。#37662

半结构化数据管理

  • 修复了时间序列compaction的一些问题。#39170 #39176

  • 修复了compaction过程中索引大小统计错误的问题。#37232

  • 修复了倒排索引对不分词的超长字符串匹配可能不正确的问题。#37679 #38218

  • 修复了 array_rangearray_with_const 函数在大数据量下内存占用高的问题。#38284 #37495

  • 修复了选择 ARRAY MAP STRUCT 类型的列时可能出现的 core dump 问题。#37936(4 周前)

  • 修复了 stream load 指定 jsonpath 时 simdjson 解析错误导致导入失败的问题。#38490

  • 修复了 JSON 数据中有重复 key 时处理异常的问题。#38146

  • 修复了 DROP INDEX 后可能出现查询报错的问题。#37646

  • 修复了索引compaction时在合并行检查中的错误返回问题。#38732

  • 倒排索引 v2 格式现在支持修改列名。#38079

  • 修复了没有索引时 MATCH 函数匹配空字符串时 coredump 的问题。#37947

  • 修复了倒排索引对 NULL 值处理的问题。#37921 #37842 #38741

  • 修复了 FE 重启后 row_store_page_size 不正确的问题。#38240

其他

  • 修复了时区配置问题,现在默认时区不再固定为 UTC+8,而是从系统配置中获取。#37294
  • 修复了由于存在多个 JSR 规范实现导致使用 Ranger 时出现的类冲突问题。#37575
  • 修复了一些 BE 代码中字段可能未初始化的问题。#37403
  • 修复了 random distributed 表 delete 语句报错的问题。#37985
  • 修复了创建同步物化视图时错误地需要基表的 alter_priv 权限问题。#38011
  • 修复了当 TVF 中使用了 resource 时未对 resource 鉴权的问题。#36928

@gavinchou
Copy link
Contributor Author

Thanks all who contribute to this release:

133tosakarin 924060929 AshinGau Baymine BePPPower BiteTheDDDDt ByteYue CalvinKirs Ceng23333 DarvenDuan FreeOnePlus Gabriel39 HappenLee JNSimba Jibing-Li KassieZ Lchangliang LiBinfeng-01 Mryange SWJTU-ZhangLei TangSiyang2001 Tech-Circle-48 Vallishp Yukang-Lian Yulei-Yang airborne12 amorynan bobhan1 cambyzju cjj2010 csun5285 dataroaring deardeng eldenmoon englefly feiniaofeiafei felixwluo freemandealer gavinchou ghkang98 hello-stephen hubgeter hust-hhb jacktengg kaijchen kaka11chen keanji-x liaoxin01 liutang123 luwei16 luzhijing lxr599 morningman morrySnow mrhhsg mymeiyi platoneko qidaye qzsee seawinde shuke987 sollhui starocean999 suxiaogang223 w41ter wangbo wangshuo128 whutpencil wsjz wuwenchi wyxxxcat xiaokang xiedeyantu xinyiZzz xy720 xzj7019 yagagagaga yiguolei yujun777 z404289981 zclllyybb zddr zfr9527 zhangbutao zhangstar333 zhannngchen zhiqiang-hhhh zjj zy-kkk zzzxl1993

@gavinchou gavinchou changed the title [WIP] 3.0.1 release note 3.0.1 release note Aug 23, 2024
@gavinchou gavinchou changed the title 3.0.1 release note Release Note 3.0.1 Aug 23, 2024
@zhangm365
Copy link


Lakehouse:

Optimized the read logic for Hudi snapshot reads, returning an empty set when the snapshot is empty, consistent with Spark behavior. #37702

Is the expression about the hudi query type changed to this way?
Hudi snapshot reads ---> Hudi incremental reads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants