-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Multiple perf improvements and viztracer
performance profiling
#298
Conversation
WalkthroughWalkthroughThis update introduces performance optimizations, enhanced schema management, and new configuration settings across multiple files. Key changes include adding caching for schema normalization and trace visualization configurations with VizTracer. Additionally, new scripts for performance testing and profiling with Snowflake and BigQuery caches have been added. Parallel file upload capabilities in Snowflake processing and increased batch sizes in file processing further enhance the performance. Changes
Sequence DiagramsSchema Management with CachingsequenceDiagram
participant User
participant SQLProcessor
User->>SQLProcessor: _ensure_schema_exists()
SQLProcessor->>SQLProcessor: Check _known_schemas_list
alt Schema exists
SQLProcessor->>User: Return
else Schema does not exist
SQLProcessor->>SQLProcessor: Create schema
SQLProcessor->>SQLProcessor: Add schema to _known_schemas_list
SQLProcessor->>User: Return
end
Parallel File Upload in Snowflake ProcessingsequenceDiagram
participant SQLProcessor
participant ThreadPoolExecutor
participant Snowflake
SQLProcessor->>ThreadPoolExecutor: Initialize with MAX_UPLOAD_THREADS (8)
loop For each file to upload
SQLProcessor->>ThreadPoolExecutor: Submit file upload task
end
ThreadPoolExecutor->>Snowflake: Upload files in parallel
Snowflake->>SQLProcessor: Acknowledge uploads
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
@coderabbitai review |
Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Outside diff range, codebase verification and nitpick comments (2)
examples/run_perf_test_reads.py (2)
16-18
: Consider adding a script header or usage documentation.The VizTracer commands are included directly in the script comments. It would be beneficial to encapsulate these in a script header or provide a usage section in the documentation to guide users on how to execute the script properly.
99-119
: Enhance command-line argument validation.The script uses
argparse
to handle command-line arguments, which is good. However, consider adding more detailed help messages or validations to ensure that the user inputs are within expected ranges or formats. This will improve the user experience and prevent runtime errors due to invalid inputs.+ parser.add_argument("-e", type=int, required=True, choices=range(2, 7), help="The scale, as a power of 10. Choose between 2 and 6.")
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
Files selected for processing (10)
- .viztracerrc (1 hunks)
- airbyte/_future_cdk/sql_processor.py (3 hunks)
- airbyte/_processors/file/base.py (1 hunks)
- airbyte/_processors/sql/snowflake.py (3 hunks)
- airbyte/_util/name_normalizers.py (2 hunks)
- airbyte/progress.py (4 hunks)
- examples/run_bigquery_faker.py (1 hunks)
- examples/run_perf_test_cache_snowflake.py (1 hunks)
- examples/run_perf_test_reads.py (1 hunks)
- pyproject.toml (1 hunks)
Files skipped from review due to trivial changes (1)
- pyproject.toml
Additional comments not posted (13)
.viztracerrc (1)
4-8
: Review of VizTracer configuration settings.The configuration settings for VizTracer are as follows:
max_stack_depth = 20
: Limits the stack depth to 20, which is reasonable for most applications.unique_output_file = True
: Ensures that each output file is unique.output_file = viztracer_report.json
: Specifies the output file name.tracer_entries = 5_000_000
: Allocates memory for 5 million entries.These settings seem appropriate for performance profiling and are consistent with best practices.
examples/run_perf_test_cache_snowflake.py (3)
7-7
: Performance profiling command.The command to run the performance profiling is specified in the docstring. This is a good practice as it provides clear instructions on how to execute the script for its intended purpose.
18-38
: Sensitive information handling and Snowflake cache setup.The script retrieves sensitive information from Google Cloud's Secret Manager and uses it to configure a Snowflake cache. It's crucial to ensure that
credentials_json
and other sensitive data are handled securely and not exposed in logs or error messages.Verification successful
Sensitive information handling and Snowflake cache setup verified.
The script does not expose sensitive information in logs or error messages. The
print(f"Stream {name}: {len(list(result[name]))} records")
is safe asname
is a static string andresult
is derived from a cache read operation.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that sensitive information is not logged or exposed in error messages. # Test: Search for logging statements that might expose sensitive information. rg --type python -A 1 $'print|log' ./examples/run_perf_test_cache_snowflake.pyLength of output: 110
Script:
#!/bin/bash # Description: Verify that sensitive information is not logged or exposed in error messages. # Test: Search for logging statements that might expose sensitive information. rg 'print|log' ./examples/run_perf_test_cache_snowflake.pyLength of output: 123
Script:
#!/bin/bash # Description: Verify that sensitive information is not logged or exposed in error messages. # Search for the definitions and usages of `name` and `result` in the specified file. rg 'name|result' ./examples/run_perf_test_cache_snowflake.pyLength of output: 308
40-49
: Data retrieval and processing.The script configures an Airbyte source, selects streams, and reads data using the configured Snowflake cache. It's important to verify that the
read
method handles exceptions properly, especially since it interacts with external systems.examples/run_bigquery_faker.py (1)
52-52
: Improved handling of missingdataset_id
.The change to use
.get
with a default value fordataset_id
is a good practice. It prevents the script from failing ifdataset_id
is not provided, making the script more robust.airbyte/_util/name_normalizers.py (1)
52-52
: Caching implementation innormalize
method.The addition of
@functools.cache
to thenormalize
method is an excellent use of Python's built-in caching mechanism. This should significantly improve performance by avoiding redundant calculations, especially in scenarios where normalization is frequently called with the same inputs.airbyte/_processors/sql/snowflake.py (2)
6-6
: Validate the import of ThreadPoolExecutor.Ensure that the
ThreadPoolExecutor
is used appropriately within the context of the file and that its import does not introduce unnecessary dependencies or complexity.
30-30
: Confirm the usage ofMAX_UPLOAD_THREADS
.The addition of the
MAX_UPLOAD_THREADS
constant is appropriate for controlling the number of concurrent threads. Verify that this constant is used consistently across the file and does not lead to resource exhaustion.Verification successful
Confirmed the usage of
MAX_UPLOAD_THREADS
.The constant
MAX_UPLOAD_THREADS
is used in theThreadPoolExecutor
to control the number of concurrent threads, which is appropriate and should help prevent resource exhaustion.
MAX_UPLOAD_THREADS
is defined and used inThreadPoolExecutor
withmax_workers=MAX_UPLOAD_THREADS
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of MAX_UPLOAD_THREADS across the file. # Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere. rg --type python 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.pyLength of output: 105
Script:
#!/bin/bash # Description: Verify the usage of MAX_UPLOAD_THREADS across the file. # Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere. rg 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.pyLength of output: 160
airbyte/_processors/file/base.py (1)
26-26
: Confirm the impact of increasedDEFAULT_BATCH_SIZE
.Increasing the
DEFAULT_BATCH_SIZE
could significantly affect memory usage and performance. It's important to ensure that this change has been thoroughly tested, especially in scenarios with large datasets.airbyte/progress.py (1)
Line range hint
88-402
:
Review changes to time formatting and type handling.The changes to support float values for time calculations are crucial for precision. Ensure that these changes are consistent throughout the file and that they do not introduce any regressions in time tracking or display functions.
airbyte/_future_cdk/sql_processor.py (3)
165-165
: Attribute_known_schemas_list
added: Good for performance, consider adding maintenance comments.The addition of
_known_schemas_list
is a positive change for performance as it reduces the number of database schema checks. However, consider adding comments on how and when this list should be updated or invalidated to ensure it remains accurate.
311-313
: Optimized schema existence check: Good for performance, consider handling cache inconsistencies.The method now first checks the cached list of known schemas which is a good optimization. However, consider adding error handling or a mechanism to handle situations where the cache might be stale or inaccurate.
381-397
: Addedforce_refresh
parameter to_get_schemas_list
: Enhances flexibility in cache management.The addition of
force_refresh
parameter allows for better control over schema list caching, which can be crucial in environments where schema changes are frequent. Consider adding unit tests to ensure that this parameter works as expected, especially in scenarios where the cache needs to be bypassed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
/fix-pr
|
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- .viztracerrc (1 hunks)
- airbyte/_processors/sql/snowflake.py (3 hunks)
Files skipped from review due to trivial changes (1)
- .viztracerrc
Additional context used
Ruff
airbyte/_processors/sql/snowflake.py
136-136: Use
logging.exception
instead oflogging.error
Replace with
exception
(TRY400)
136-136: Undefined name
logging
(F821)
Additional comments not posted (2)
airbyte/_processors/sql/snowflake.py (2)
6-6
: Import of ThreadPoolExecutor is appropriate for the context.The import of
ThreadPoolExecutor
aligns with the PR's objective to improve performance through parallel processing. This is a good use of Python's concurrent programming capabilities to enhance data upload speeds.
30-31
: ConstantMAX_UPLOAD_THREADS
is well-defined.Setting a constant for the maximum number of upload threads is a good practice. It helps maintain the code's configurability and makes it easier to adjust the performance tuning.
viztracer
viztracer
performance profiling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- .gitignore (1 hunks)
- airbyte/_processors/sql/snowflake.py (4 hunks)
- airbyte/caches/bigquery.py (2 hunks)
- examples/run_perf_test_reads.py (1 hunks)
Files skipped from review due to trivial changes (1)
- .gitignore
Files skipped from review as they are similar to previous changes (2)
- airbyte/_processors/sql/snowflake.py
- examples/run_perf_test_reads.py
Additional context used
Learnings (1)
airbyte/caches/bigquery.py (1)
Learnt from: aaronsteers PR: airbytehq/PyAirbyte#281 File: airbyte/caches/bigquery.py:40-43 Timestamp: 2024-07-09T04:12:04.672Z Learning: The `BigQueryCache.get_arrow_dataset` method should have a docstring that correctly states the reason for the `NotImplementedError` as BigQuery not supporting `to_arrow`, instead of incorrectly mentioning `pd.read_sql_table`.
Additional comments not posted (2)
airbyte/caches/bigquery.py (2)
20-20
: Approved: Import statement forNoReturn
.The import of
NoReturn
from thetyping
module is correctly placed and necessary for the changes in theget_arrow_dataset
method.
41-41
: Approved: Change in return type and updated exception handling.Changing the return type from
None
toNoReturn
clarifies thatget_arrow_dataset
is not intended to return any value and will always raise an exception. This is a good practice for functions that are designed to not return under normal operation.Additionally, the docstring and the exception message have been updated to reflect the correct reason why
get_arrow_dataset
is not implemented, which aligns with the previous learning. The provided URL in the docstring is a good practice as it offers more context on the issue.
Summary by CodeRabbit
New Features
Enhancements
Bug Fixes
Chores
.gitignore
to exclude VizTracer log files and packaged documentation zips.General Perf Improvements (Read-to-DuckDB)
Before (15K records per second):
After (23K records per second):
Snowflake Perf Improvements (50K records)
Before (load takes 32 seconds):
After (load takes 25 seconds):
Snowflake Perf Improvements (500K records)
Before (load takes 2 min, 43 seconds):
After (load takes 25 seconds):
Snowflake 2M Records test:
Total time is 3min, 31s - 1min 53sec to extract the data and 1min 48sec to load it to Snowflake.
Comparable Airbyte Cloud job here (internal link):
https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/c7986ffb-124c-4c65-be06-a74970f2e6fb/status
Snowflake 50M Records test:
Comparable Airbyte Cloud job here (internal link):
https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/b356cca2-f81d-4f40-bf8b-42d50de0027e/job-history