V1.3.4 #472

flarco · 2024-12-23T18:21:53Z

Task Storage Refactoring

Consolidated StoreInsert and StoreUpdate into a single StoreSet function
Improved state management and storage operations

Connection Handling

Added new connection context handling methods (AsDatabaseContext, AsFileContext)
Improved connection pooling and caching mechanisms
Enhanced TLS configuration for MySQL connections

Delete Missing Feature

Added new DeleteMissing functionality for incremental mode
Introduced slingDeletedAtColumn for tracking deleted records
Added soft/hard delete options

DuckDB Improvements

Added HTTP-based import method alongside existing CSV and named pipes methods
Enhanced partitioned file writing for both Parquet and CSV formats
Improved temporary table handling

Incremental Processing

Enhanced state management for incremental updates
Added support for incremental state with update keys
Improved handling of incremental values via state storage

Dependencies

Updated various dependencies including:
- github.com/flarco/g to v0.1.133
- github.com/microsoft/go-mssqldb to v1.8.0
- Added github.com/labstack/echo/v4 for HTTP handling

- updated github.com/microsoft/go-mssqldb from v1.7.2 to v1.8.0

- Added handling for casting string columns to fixed-type columns in Snowflake. - Addresses an issue where string-to-fixed type casting was not properly handled, leading to potential errors.

- Improved error handling during replication to prevent cascading failures. - Added `FailErr` field to `ReplicationConfig` to store the error encountered when a connection issue occurs. - Modified `replicationRun` to set `FailErr` when a connection error is detected, stopping further execution.

- Added a unique connection ID to the MySQL connection to improve TLS configuration management. - The `Init()` function now registers the TLS configuration using the unique connection ID. - Added error handling for TLS configuration registration. - Modified `GetURL` to use the unique connection ID as the key for custom TLS configurations. - This prevents issues with multiple connections potentially using the same TLS configuration.

- Added functionality to expand environment variables within the envfile. - Improved flexibility and ease of configuration. - Enhanced security by allowing sensitive information to be stored in environment variables rather than directly in the envfile.

- Replaced `g.Rme` with `g.Rmd` in `LoadEnvFile` function to fix a potential bug in environment variable processing. - Updated the `github.com/flarco/g` dependency to version v0.1.133.

- Fixed a typo in the `json` type mapping for `nvarchar(max)` in `types_general_to_native.tsv`. The previous entry incorrectly listed `nvarchar(65535)` twice. This commit corrects it to the consistent `nvarchar(max)`.

- Replaced manual map population with a loop for better readability and maintainability. - Improved code clarity and reduced redundancy.

- Use `osext.Executable()` to reliably get the executable path, replacing previous method which relied on string matching. This improves accuracy and robustness across different environments. - Moved `osext.Executable()` call to `init()` function to ensure it's run once during initialization. - Updated `checkUpdate` function to use the updated `env.Executable` variable for more accurate package detection. This removes the need to directly call `osext.Executable()` within this function.

- Replaced connection name with connection hash in cache key to fix caching issues. - This ensures that different connections with the same name but different configurations are cached separately.

- increased precision of timestamp layout in clickhouse template to 9 decimal places for improved accuracy - enhanced sqlserver timestamp layout to handle microseconds and timezone information more reliably, improving data consistency

- Updated datetime format strings in `dateLayouts` to use 9 digits for nanosecond precision instead of 6. - Modified `CastToString`, `CastToStringSafe`, and `CastToStringSafeMask` functions to reflect the change in precision. This ensures consistent and accurate handling of datetime values across different functions and contexts.

- Added new functions to extract partition levels from file paths and truncate timestamps based on the specified partition level. - Added tests to cover all scenarios for partition level extraction and timestamp truncation. - Improved validation of partition levels and added a test case to cover invalid partition levels. - Created PartitionLevel type to represent the available partition levels, and added corresponding methods: IsValid and TruncateTime. - Added tests to cover all scenarios for partition level truncation. ✨ feat(cmd/sling): improve sling CLI test and add new tests - Added a check to see if the sling binary exists before running tests. - Added new test cases for different scenarios, including CSV source with single quote and $symbol quote, direct insert full-refresh, and incremental with delete missing (soft and hard). - Added a test case for writing to partitioned parquet files, both locally and on AWS S3. - Added test cases to cover incremental writing to partitioned parquet files. - Added new test cases to cover all scenarios for different partitioning options. 🐛 fix(core/dbio/database): improve duckdb log message - Changed the log level for the "The file ... does not exist" message from debug to trace to prevent excess information in logs ♻️ refactor(core/dbio/filesys): improve copy from local recursive function - Changed concurrency handling for file copying with a context to allow for proper cancellation - Added error handling when processing files in a recursive copy operation - Added debug log when writing partitions 🐛 fix(core/dbio/iop): improve validation and handling of partition levels in DuckDB - Changed partition fields from a string array to an array of PartitionLevel to improve validation and error handling. - Added new enum `PartitionLevel` for improved type safety and readability of partition levels. - Fixed bug where partition expressions for month, week, and day were not correctly formatted. - Added validation to prevent invalid PartitionLevels being used. - Added new `PartitionLevelsAscending` and `PartitionLevelsDescending` constants for consistency and clarity. 🐛 fix(core/dbio/scripts): update test script for better code coverage - Updated test script to run all test cases of the `iop` module. ♻️ refactor(core/sling): improve incremental value handling - Added `IncrementalGTE` flag to config to allow >= comparison for incremental mode. 🐛 fix(core/sling): improve handling of incremental mode with update keys - Improved handling of incremental mode with update keys to use SLING_STATE environment variable for better state management. - Improved error handling when `SLING_STATE` is not set but using `update_key` field in incremental mode. ♻️ refactor(core/sling): remove unnecessary function - Removed the `extractPartFields` function as its functionality was superseded by the newly introduced `iop.ExtractPartitionFields` function. 🐛 fix(core/sling): improve handling of incremental writes to files - Improved handling of incremental writes to files by using `>=` instead of `>` when comparing update keys, and updating the query accordingly. - updated `ReadFromDB` function to use `>=` for `incremental_gte` property and added tests to cover this functionality

- updated sling state handling to support update key with incremental mode and sling state - improved logic for determining whether to use duckdb for writing data - optimized condition check for incremental state with update key.

- Added a new import method using a local HTTP server to improve performance and handle large datasets. - The HTTP server serves data in CSV format, allowing DuckDB to efficiently import the data using `read_csv`. - Implemented error handling for server startup and data streaming. - Improved logging to track import progress and handle potential issues. - Added support for configuring the CSV import parameters.

- added a small delay after waiting for the local server to start to improve stability

- added check for runtime variables in ObjectHasStreamVars() - set Single to true if no runtime variables are found in wildcard replication - improves handling of wildcard replication scenarios without runtime variables

- Correctly set `Single` flag for wildcard targets without runtime variables, considering `FileMaxBytes` and `FileMaxRows` settings. - Improves accuracy of replication configuration processing for wildcard targets.

- corrected the prefix assertion to include "file://" to accurately reflect the stream path.

- Updated the `table_incremental_from_postgres` test to use the `email` column instead of the `code` column for incremental updates. This aligns with recent schema changes and ensures the test continues to function correctly.

- changed primary key for StarRocks to `id,email` to resolve incompatibility with decimal primary keys - updated `suite.db.template.tsv` to reflect the change in primary key for the `table_incremental_from_postgres` test case. This ensures that tests run correctly with StarRocks.

- starrocks doesn't support decimal as primary key, fix the primary key to "id" only

- Increased timeout for database tests from 15m to 25m to prevent intermittent failures due to long-running operations.

- Addresses an issue where wildcard streams were not processed correctly, leading to errors. - Improved wildcard stream handling by using a clone stream to apply defaults. - Added default setting for single streams with zero file max variables. - Ensured that the correct stream configuration is used for wildcard streams.

flarco added 23 commits December 17, 2024 05:38

bump microsoft/go-mssqldb from 1.7.2 to 1.8.0

eb7ff40

- updated github.com/microsoft/go-mssqldb from v1.7.2 to v1.8.0

handle string to fixed type casting in Snowflake

7daa4eb

- Added handling for casting string columns to fixed-type columns in Snowflake. - Addresses an issue where string-to-fixed type casting was not properly handled, leading to potential errors.

use StoreSet

28c02f6

bump github.com/flarco/g & correct function call in envfile.go

38fab85

- Replaced `g.Rme` with `g.Rmd` in `LoadEnvFile` function to fix a potential bug in environment variable processing. - Updated the `github.com/flarco/g` dependency to version v0.1.133.

Refactor connection initialization in database.go and database_mysql.go

c1b168e

Add lower and upper case string transformation functions

9ba020b

add support for delete_missing option and clean up

a2077a3

Refactor file node filtering methods in fs_file_node.go

d6cb2f2

improve dataflow statistics and incremental loading

b134f69

handle temp table name in target options

bf5d360

correct json type mapping in types_general_to_native.tsv

0d099e0

- Fixed a typo in the `json` type mapping for `nvarchar(max)` in `types_general_to_native.tsv`. The previous entry incorrectly listed `nvarchar(65535)` twice. This commit corrects it to the consistent `nvarchar(max)`.

improve transforms initialization

8b6f743

- Replaced manual map population with a loop for better readability and maintainability. - Improved code clarity and reduced redundancy.

use connection hash for caching

12bfe56

- Replaced connection name with connection hash in cache key to fix caching issues. - This ensures that different connections with the same name but different configurations are cached separately.

partition csv format & optimize sling state handling

0ba1b92

- updated sling state handling to support update key with incremental mode and sling state - improved logic for determining whether to use duckdb for writing data - optimized condition check for incremental state with update key.

improve duckdb http import reliability

223f5ff

- added a small delay after waiting for the local server to start to improve stability

flarco mentioned this pull request Dec 25, 2024

Insufficient datetime2 precision causes record duplication when using incremental update_key #471

Closed

flarco added 6 commits December 25, 2024 08:05

handle wildcard replication without runtime variables

03a06ac

- added check for runtime variables in ObjectHasStreamVars() - set Single to true if no runtime variables are found in wildcard replication - improves handling of wildcard replication scenarios without runtime variables

handle wildcard targets without runtime variables

192bfd4

- Correctly set `Single` flag for wildcard targets without runtime variables, considering `FileMaxBytes` and `FileMaxRows` settings. - Improves accuracy of replication configuration processing for wildcard targets.

correct stream prefix assertion in sling_test.go

f0d4f70

- corrected the prefix assertion to include "file://" to accurately reflect the stream path.

clean up go.mod

1b6f63b

update incremental test to use email instead of code

a1d7117

- Updated the `table_incremental_from_postgres` test to use the `email` column instead of the `code` column for incremental updates. This aligns with recent schema changes and ensures the test continues to function correctly.

flarco added 8 commits December 26, 2024 14:42

correct primary key for starrocks tests

e0039f9

- starrocks doesn't support decimal as primary key, fix the primary key to "id" only

increase timeout for database tests

e30074b

- Increased timeout for database tests from 15m to 25m to prevent intermittent failures due to long-running operations.

handle default case in duckdb BulkImportFlow

d828f57

add timestamp layout variables for azure sql

d676c95

improve duckdb connection closing and logging

2dd0336

handle closed connection in ExecContext

56a7fb6

update r.05 test target / env

b822fb7

flarco merged commit 9f2fdc5 into main Dec 27, 2024
8 checks passed

flarco deleted the v1.3.4 branch December 27, 2024 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V1.3.4 #472

V1.3.4 #472

flarco commented Dec 23, 2024 •

edited

Loading

V1.3.4 #472

V1.3.4 #472

Conversation

flarco commented Dec 23, 2024 • edited Loading

flarco commented Dec 23, 2024 •

edited

Loading