Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Multiple perf improvements and viztracer performance profiling #298

Merged
merged 18 commits into from
Jul 17, 2024

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Jul 17, 2024

Summary by CodeRabbit

  • New Features

    • Introduced default configuration settings for VizTracer to enhance program execution tracing and visualization.
    • Added scripts for profiling performance of Snowflake cache and read throughput.
  • Enhancements

    • Improved schema handling with force refresh option for enhanced data integrity.
    • Updated batch size limit for file writing to improve performance.
    • Enabled parallel file uploads in Snowflake using thread pools.
    • Applied caching to the normalize method for better performance.
    • Enhanced time calculation precision in progress tracking.
  • Bug Fixes

    • Fixed a bug in dataset name handling for BigQuery integration tests.
  • Chores

    • Updated .gitignore to exclude VizTracer log files and packaged documentation zips.

General Perf Improvements (Read-to-DuckDB)

Before (15K records per second):

image

After (23K records per second):

image

Snowflake Perf Improvements (50K records)

Before (load takes 32 seconds):

image

After (load takes 25 seconds):

image

Snowflake Perf Improvements (500K records)

Before (load takes 2 min, 43 seconds):

image

After (load takes 25 seconds):

image

Snowflake 2M Records test:

Total time is 3min, 31s - 1min 53sec to extract the data and 1min 48sec to load it to Snowflake.

image

Comparable Airbyte Cloud job here (internal link):

https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/c7986ffb-124c-4c65-be06-a74970f2e6fb/status

Snowflake 50M Records test:

image

Comparable Airbyte Cloud job here (internal link):

https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/b356cca2-f81d-4f40-bf8b-42d50de0027e/job-history

Copy link

coderabbitai bot commented Jul 17, 2024

Walkthrough

Walkthrough

This update introduces performance optimizations, enhanced schema management, and new configuration settings across multiple files. Key changes include adding caching for schema normalization and trace visualization configurations with VizTracer. Additionally, new scripts for performance testing and profiling with Snowflake and BigQuery caches have been added. Parallel file upload capabilities in Snowflake processing and increased batch sizes in file processing further enhance the performance.

Changes

Files Summary of Changes
.viztracerrc, .gitignore Added VizTracer configuration settings and updated .gitignore to exclude VizTracer log files and packaged docs.
airbyte/_future_cdk/sql_processor.py Introduced schema caching and a force refresh parameter for schema list retrieval.
airbyte/_processors/file/base.py Updated the default batch size from 10,000 to 100,000.
airbyte/_processors/sql/snowflake.py Added parallel file upload using ThreadPoolExecutor with a maximum of 8 threads.
airbyte/_util/name_normalizers.py Added caching to the normalize method in LowerCaseNormalizer using functools.cache.
airbyte/progress.py Changed time formatting and type handling for improved precision in time calculations and display updates.
examples/run_bigquery_faker.py Modified to use a default dataset name if not specified in the configuration.
examples/run_perf_test_cache_snowflake.py New script for profiling performance with Snowflake cache.
examples/run_perf_test_reads.py New script for profiling read throughput performance with command-line arguments and VizTracer for performance profiling.
airbyte/caches/bigquery.py Updated return type of get_arrow_dataset to NoReturn, raising NotImplementedError explicitly.

Sequence Diagrams

Schema Management with Caching

sequenceDiagram
    participant User
    participant SQLProcessor

    User->>SQLProcessor: _ensure_schema_exists()
    SQLProcessor->>SQLProcessor: Check _known_schemas_list
    alt Schema exists
        SQLProcessor->>User: Return
    else Schema does not exist
        SQLProcessor->>SQLProcessor: Create schema
        SQLProcessor->>SQLProcessor: Add schema to _known_schemas_list
        SQLProcessor->>User: Return
    end
Loading

Parallel File Upload in Snowflake Processing

sequenceDiagram
    participant SQLProcessor
    participant ThreadPoolExecutor
    participant Snowflake

    SQLProcessor->>ThreadPoolExecutor: Initialize with MAX_UPLOAD_THREADS (8)
    loop For each file to upload
        SQLProcessor->>ThreadPoolExecutor: Submit file upload task
    end
    ThreadPoolExecutor->>Snowflake: Upload files in parallel
    Snowflake->>SQLProcessor: Acknowledge uploads
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@aaronsteers
Copy link
Contributor Author

@coderabbitai review

Copy link

coderabbitai bot commented Jul 17, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range, codebase verification and nitpick comments (2)
examples/run_perf_test_reads.py (2)

16-18: Consider adding a script header or usage documentation.

The VizTracer commands are included directly in the script comments. It would be beneficial to encapsulate these in a script header or provide a usage section in the documentation to guide users on how to execute the script properly.


99-119: Enhance command-line argument validation.

The script uses argparse to handle command-line arguments, which is good. However, consider adding more detailed help messages or validations to ensure that the user inputs are within expected ranges or formats. This will improve the user experience and prevent runtime errors due to invalid inputs.

+ parser.add_argument("-e", type=int, required=True, choices=range(2, 7), help="The scale, as a power of 10. Choose between 2 and 6.")
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 326683c and c282122.

Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
Files selected for processing (10)
  • .viztracerrc (1 hunks)
  • airbyte/_future_cdk/sql_processor.py (3 hunks)
  • airbyte/_processors/file/base.py (1 hunks)
  • airbyte/_processors/sql/snowflake.py (3 hunks)
  • airbyte/_util/name_normalizers.py (2 hunks)
  • airbyte/progress.py (4 hunks)
  • examples/run_bigquery_faker.py (1 hunks)
  • examples/run_perf_test_cache_snowflake.py (1 hunks)
  • examples/run_perf_test_reads.py (1 hunks)
  • pyproject.toml (1 hunks)
Files skipped from review due to trivial changes (1)
  • pyproject.toml
Additional comments not posted (13)
.viztracerrc (1)

4-8: Review of VizTracer configuration settings.

The configuration settings for VizTracer are as follows:

  • max_stack_depth = 20: Limits the stack depth to 20, which is reasonable for most applications.
  • unique_output_file = True: Ensures that each output file is unique.
  • output_file = viztracer_report.json: Specifies the output file name.
  • tracer_entries = 5_000_000: Allocates memory for 5 million entries.

These settings seem appropriate for performance profiling and are consistent with best practices.

examples/run_perf_test_cache_snowflake.py (3)

7-7: Performance profiling command.

The command to run the performance profiling is specified in the docstring. This is a good practice as it provides clear instructions on how to execute the script for its intended purpose.


18-38: Sensitive information handling and Snowflake cache setup.

The script retrieves sensitive information from Google Cloud's Secret Manager and uses it to configure a Snowflake cache. It's crucial to ensure that credentials_json and other sensitive data are handled securely and not exposed in logs or error messages.

Verification successful

Sensitive information handling and Snowflake cache setup verified.

The script does not expose sensitive information in logs or error messages. The print statement found only outputs the stream name and the number of records, which are not sensitive.

  • print(f"Stream {name}: {len(list(result[name]))} records") is safe as name is a static string and result is derived from a cache read operation.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that sensitive information is not logged or exposed in error messages.

# Test: Search for logging statements that might expose sensitive information.
rg --type python -A 1 $'print|log' ./examples/run_perf_test_cache_snowflake.py

Length of output: 110


Script:

#!/bin/bash
# Description: Verify that sensitive information is not logged or exposed in error messages.

# Test: Search for logging statements that might expose sensitive information.
rg 'print|log' ./examples/run_perf_test_cache_snowflake.py

Length of output: 123


Script:

#!/bin/bash
# Description: Verify that sensitive information is not logged or exposed in error messages.

# Search for the definitions and usages of `name` and `result` in the specified file.
rg 'name|result' ./examples/run_perf_test_cache_snowflake.py

Length of output: 308


40-49: Data retrieval and processing.

The script configures an Airbyte source, selects streams, and reads data using the configured Snowflake cache. It's important to verify that the read method handles exceptions properly, especially since it interacts with external systems.

examples/run_bigquery_faker.py (1)

52-52: Improved handling of missing dataset_id.

The change to use .get with a default value for dataset_id is a good practice. It prevents the script from failing if dataset_id is not provided, making the script more robust.

airbyte/_util/name_normalizers.py (1)

52-52: Caching implementation in normalize method.

The addition of @functools.cache to the normalize method is an excellent use of Python's built-in caching mechanism. This should significantly improve performance by avoiding redundant calculations, especially in scenarios where normalization is frequently called with the same inputs.

airbyte/_processors/sql/snowflake.py (2)

6-6: Validate the import of ThreadPoolExecutor.

Ensure that the ThreadPoolExecutor is used appropriately within the context of the file and that its import does not introduce unnecessary dependencies or complexity.


30-30: Confirm the usage of MAX_UPLOAD_THREADS.

The addition of the MAX_UPLOAD_THREADS constant is appropriate for controlling the number of concurrent threads. Verify that this constant is used consistently across the file and does not lead to resource exhaustion.

Verification successful

Confirmed the usage of MAX_UPLOAD_THREADS.

The constant MAX_UPLOAD_THREADS is used in the ThreadPoolExecutor to control the number of concurrent threads, which is appropriate and should help prevent resource exhaustion.

  • MAX_UPLOAD_THREADS is defined and used in ThreadPoolExecutor with max_workers=MAX_UPLOAD_THREADS.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of MAX_UPLOAD_THREADS across the file.

# Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere.
rg --type python 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.py

Length of output: 105


Script:

#!/bin/bash
# Description: Verify the usage of MAX_UPLOAD_THREADS across the file.

# Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere.
rg 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.py

Length of output: 160

airbyte/_processors/file/base.py (1)

26-26: Confirm the impact of increased DEFAULT_BATCH_SIZE.

Increasing the DEFAULT_BATCH_SIZE could significantly affect memory usage and performance. It's important to ensure that this change has been thoroughly tested, especially in scenarios with large datasets.

airbyte/progress.py (1)

Line range hint 88-402:
Review changes to time formatting and type handling.

The changes to support float values for time calculations are crucial for precision. Ensure that these changes are consistent throughout the file and that they do not introduce any regressions in time tracking or display functions.

airbyte/_future_cdk/sql_processor.py (3)

165-165: Attribute _known_schemas_list added: Good for performance, consider adding maintenance comments.

The addition of _known_schemas_list is a positive change for performance as it reduces the number of database schema checks. However, consider adding comments on how and when this list should be updated or invalidated to ensure it remains accurate.


311-313: Optimized schema existence check: Good for performance, consider handling cache inconsistencies.

The method now first checks the cached list of known schemas which is a good optimization. However, consider adding error handling or a mechanism to handle situations where the cache might be stale or inaccurate.


381-397: Added force_refresh parameter to _get_schemas_list: Enhances flexibility in cache management.

The addition of force_refresh parameter allows for better control over schema list caching, which can be crucial in environments where schema changes are frequent. Consider adding unit tests to ensure that this parameter works as expected, especially in scenarios where the cache needs to be bypassed.

.viztracerrc Outdated Show resolved Hide resolved
examples/run_perf_test_reads.py Show resolved Hide resolved
examples/run_perf_test_reads.py Show resolved Hide resolved
airbyte/_processors/sql/snowflake.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c282122 and c85f8de.

Files selected for processing (1)
  • pyproject.toml (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Jul 17, 2024

/fix-pr

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.
(This job requires that the PR author has "Allow edits from maintainers" enabled.)

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c85f8de and b62b212.

Files selected for processing (2)
  • .viztracerrc (1 hunks)
  • airbyte/_processors/sql/snowflake.py (3 hunks)
Files skipped from review due to trivial changes (1)
  • .viztracerrc
Additional context used
Ruff
airbyte/_processors/sql/snowflake.py

136-136: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


136-136: Undefined name logging

(F821)

Additional comments not posted (2)
airbyte/_processors/sql/snowflake.py (2)

6-6: Import of ThreadPoolExecutor is appropriate for the context.

The import of ThreadPoolExecutor aligns with the PR's objective to improve performance through parallel processing. This is a good use of Python's concurrent programming capabilities to enhance data upload speeds.


30-31: Constant MAX_UPLOAD_THREADS is well-defined.

Setting a constant for the maximum number of upload threads is a good practice. It helps maintain the code's configurability and makes it easier to adjust the performance tuning.

airbyte/_processors/sql/snowflake.py Outdated Show resolved Hide resolved
@aaronsteers aaronsteers changed the title Feat: Multiple perf improvements, add performance profiling with viztracer Feat: Multiple perf improvements and viztracer performance profiling Jul 17, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b62b212 and b9b8e82.

Files selected for processing (4)
  • .gitignore (1 hunks)
  • airbyte/_processors/sql/snowflake.py (4 hunks)
  • airbyte/caches/bigquery.py (2 hunks)
  • examples/run_perf_test_reads.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • .gitignore
Files skipped from review as they are similar to previous changes (2)
  • airbyte/_processors/sql/snowflake.py
  • examples/run_perf_test_reads.py
Additional context used
Learnings (1)
airbyte/caches/bigquery.py (1)
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#281
File: airbyte/caches/bigquery.py:40-43
Timestamp: 2024-07-09T04:12:04.672Z
Learning: The `BigQueryCache.get_arrow_dataset` method should have a docstring that correctly states the reason for the `NotImplementedError` as BigQuery not supporting `to_arrow`, instead of incorrectly mentioning `pd.read_sql_table`.
Additional comments not posted (2)
airbyte/caches/bigquery.py (2)

20-20: Approved: Import statement for NoReturn.

The import of NoReturn from the typing module is correctly placed and necessary for the changes in the get_arrow_dataset method.


41-41: Approved: Change in return type and updated exception handling.

Changing the return type from None to NoReturn clarifies that get_arrow_dataset is not intended to return any value and will always raise an exception. This is a good practice for functions that are designed to not return under normal operation.

Additionally, the docstring and the exception message have been updated to reflect the correct reason why get_arrow_dataset is not implemented, which aligns with the previous learning. The provided URL in the docstring is a good practice as it offers more context on the issue.

@aaronsteers aaronsteers enabled auto-merge (squash) July 17, 2024 16:51
@aaronsteers aaronsteers merged commit 1acd6df into main Jul 17, 2024
16 checks passed
@aaronsteers aaronsteers deleted the feat/perf-improvements-and-profiling branch July 17, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant