-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Faster per-record processing #301
Conversation
WalkthroughWalkthroughThe recent changes focus on refactoring the code to utilize a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RecordProcessor
participant StreamRecordHandler
participant StreamRecord
Client->>RecordProcessor: Call process_record_message()
RecordProcessor->>StreamRecordHandler: Initialize handler with schema
RecordProcessor->>StreamRecord: Create StreamRecord with handler
StreamRecord->>StreamRecordHandler: Validate and process record
StreamRecordHandler-->>StreamRecord: Processed data
StreamRecord-->>RecordProcessor: Processed StreamRecord
RecordProcessor-->>Client: Return processed record
sequenceDiagram
participant TestSuite
participant TestCase
participant StreamRecordHandler
participant StreamRecord
TestSuite->>TestCase: Run test cases
TestCase->>StreamRecordHandler: Initialize handler with test schema
TestCase->>StreamRecord: Create StreamRecord with handler
StreamRecord->>StreamRecordHandler: Validate and process test record
StreamRecordHandler-->>StreamRecord: Processed test data
StreamRecord-->>TestCase: Return processed test record
TestCase-->>TestSuite: Test pass/fail result
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- airbyte/_future_cdk/record_processor.py (3 hunks)
- airbyte/_future_cdk/sql_processor.py (3 hunks)
- airbyte/_processors/file/base.py (3 hunks)
- airbyte/records.py (3 hunks)
- airbyte/sources/base.py (3 hunks)
Additional comments not posted (20)
airbyte/_processors/file/base.py (3)
16-16
: Import statement change is appropriate.The addition of
StreamRecordHandler
is necessary for the new functionality.
144-144
: Function signature change is appropriate.The function now accepts
stream_record_handler
instead ofstream_schema
, aligning with the new object-oriented approach.
169-169
: Ensure correct usage ofstream_record_handler
.The
stream_record_handler
is correctly used in theStreamRecord.from_record_message
call.airbyte/_future_cdk/record_processor.py (5)
26-26
: Import statement change is appropriate.The addition of
StreamRecordHandler
is necessary for the new functionality.
160-160
: Function signature change is appropriate.The function now accepts
stream_record_handler
instead ofstream_schema
, aligning with the new object-oriented approach.
184-184
: Variable replacement is appropriate.The variable
stream_schemas
has been replaced withstream_record_handlers
, aligning with the new object-oriented approach.
192-197
: Ensure correct instantiation ofStreamRecordHandler
.The
StreamRecordHandler
is correctly instantiated with the required parameters.
202-202
: Ensure correct usage ofstream_record_handler
.The
stream_record_handler
is correctly used in theprocess_record_message
call.airbyte/records.py (6)
77-77
: Import statement change is appropriate.The import statement now includes only
NameNormalizerBase
, which is necessary for the new functionality.
92-97
: New classStreamRecordHandler
is appropriate.The class introduction is necessary for the new object-oriented approach.
99-145
: Ensure correct implementation ofStreamRecordHandler
methods.The methods in
StreamRecordHandler
are correctly implemented to handle key normalization and processing.
194-221
: Ensure correct implementation ofStreamRecord
methods.The
__init__
method and other methods inStreamRecord
are correctly updated to handle field processing.
228-243
: Ensure correct usage ofstream_record_handler
infrom_record_message
.The
stream_record_handler
is correctly used in thefrom_record_message
method.
247-279
: Ensure correct implementation of dictionary operations inStreamRecord
.The methods
__getitem__
,__setitem__
,__delitem__
, and__contains__
are correctly updated to handle key processing and dictionary operations.airbyte/sources/base.py (3)
7-7
: Import statement change is appropriate.The addition of
StreamRecordHandler
is necessary for the new functionality.
456-461
: Ensure correct instantiation ofStreamRecordHandler
.The
StreamRecordHandler
is correctly instantiated with the required parameters.
466-466
: Ensure correct usage ofstream_record_handler
.The
stream_record_handler
is correctly used in theStreamRecord.from_record_message
call.airbyte/_future_cdk/sql_processor.py (3)
65-65
: Import statement forStreamRecordHandler
added.The import statement for
StreamRecordHandler
is correctly added.
231-231
: Function signature updated to usestream_record_handler
.The function signature is updated to replace
stream_schema
withstream_record_handler
.
242-242
: Function call updated to usestream_record_handler
.The function call to
process_record_message
is updated to usestream_record_handler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- airbyte/_future_cdk/record_processor.py (3 hunks)
- airbyte/records.py (2 hunks)
- airbyte/sources/base.py (3 hunks)
- tests/unit_tests/test_text_normalization.py (4 hunks)
Files skipped from review as they are similar to previous changes (3)
- airbyte/_future_cdk/record_processor.py
- airbyte/records.py
- airbyte/sources/base.py
Additional comments not posted (7)
tests/unit_tests/test_text_normalization.py (7)
5-5
: Import statement addition looks good.The addition of
StreamRecordHandler
is consistent with the refactoring described.
8-17
: Fixture addition looks good.The
stream_json_schema
fixture is well-defined and provides a reusable JSON schema for the tests.
20-36
: Function modification looks good.The changes to
test_record_columns_list
are consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandler
.
53-60
: Function modification looks good.The changes to
test_case_insensitive_dict
are consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandler
with the specified parameters.
119-128
: Function modification looks good.The changes to
test_case_insensitive_dict_w
are consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandler
with the updated schema.
154-163
: Function modification looks good.The changes to
test_case_insensitive_w_pretty_keys
are consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandler
withnormalize_keys=False
.
216-220
: Function adjustment looks good.The changes to
test_lower_case_normalizer
improve readability without altering the logic.
This brings per-record parsing time down from approx.
20us
to roughly10us
. We accomplish this by decoupling stream-level logic from record-level processing, meaning we build the logic handler for each stream at the start of the stream, and then reuse the handler's logic on those individual records.Note:
10-13us
for each write to gzip file.write
is entirely one step performed by thegzip
module, the only options to speed up are: (a) write more records at once instead of one at a time, (b) use another library if there is a faster option, (c) do both a+b but also just migrate to something else entirely like parquet.Summary by CodeRabbit
New Features
StreamRecordHandler
to enhance stream record processing capabilities.Refactor
StreamRecordHandler
instead of dictionaries for stream schemas, improving schema handling and processing efficiency.Tests
StreamRecordHandler
with updated stream schema handling parameters.