-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Concurrent Low-Code] ConcurrentDeclarativeSource class that low-code connectors can inherit from to uptake Concurrent CDK #46662
base: master
Are you sure you want to change the base?
Changes from 12 commits
1130a74
54a1f85
0bd2deb
2ad65d9
8861054
aa55ab7
4288b8d
1650e30
98c42a7
0f79069
5992e19
6a160c0
b09311f
0180e11
0c0c019
61dfb9b
5dd1121
7bd3056
bef1c03
c2e3bdb
9116da6
80186ca
1242296
b74b942
99bee1e
15cb6dd
9b5eb62
b82c21c
6a91848
15127a7
977c525
4e38c4e
f7f3e9d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,10 +4,11 @@ | |
|
||
import logging | ||
from abc import abstractmethod | ||
from typing import Any, Mapping, Tuple | ||
from typing import Any, List, Mapping, Tuple | ||
|
||
from airbyte_cdk.sources.abstract_source import AbstractSource | ||
from airbyte_cdk.sources.declarative.checks.connection_checker import ConnectionChecker | ||
from airbyte_cdk.sources.streams import Stream | ||
|
||
|
||
class DeclarativeSource(AbstractSource): | ||
|
@@ -32,3 +33,6 @@ def check_connection(self, logger: logging.Logger, config: Mapping[str, Any]) -> | |
The error object will be cast to string to display the problem to the user. | ||
""" | ||
return self.connection_checker.check_connection(self, logger, config) | ||
|
||
def all_streams(self, config: Mapping[str, Any]) -> List[Stream]: | ||
return self.streams(config=config) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For non concurrent low-code sources, these are equivalent, but we override this implementation in |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
import copy | ||
import json | ||
import logging | ||
from datetime import datetime | ||
from functools import lru_cache | ||
from typing import Any, Iterable, List, Mapping, MutableMapping, Optional, Tuple, Union | ||
|
||
|
@@ -24,6 +25,7 @@ | |
from airbyte_cdk.sources.streams.concurrent.partitions.partition import Partition | ||
from airbyte_cdk.sources.streams.concurrent.partitions.partition_generator import PartitionGenerator | ||
from airbyte_cdk.sources.streams.concurrent.partitions.record import Record | ||
from airbyte_cdk.sources.streams.concurrent.state_converters.datetime_stream_state_converter import DateTimeStreamStateConverter | ||
from airbyte_cdk.sources.streams.core import StreamData | ||
from airbyte_cdk.sources.types import StreamSlice | ||
from airbyte_cdk.sources.utils.schema_helpers import InternalConfig | ||
|
@@ -203,6 +205,11 @@ class SliceEncoder(json.JSONEncoder): | |
def default(self, obj: Any) -> Any: | ||
if hasattr(obj, "__json_serializable__"): | ||
return obj.__json_serializable__() | ||
|
||
# This needs to be revisited as we can't lose precision | ||
if isinstance(obj, datetime): | ||
return list(obj.timetuple())[0:6] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we set 6 as a variable to avoid using magic numbers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes thank you for calling this out. I had put this in as a placeholder as I was working through getting this tested the first time around and this needs to be reinvestigated/fixed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't follow this -- Is there somewhere we would be serializing a datetime object? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'll double check, but I think we needed this serializer deeper in our code, potentially in how we emit state back out. I'll reconfirm this as I work through @lazebnyi comment above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also curious about this. From my understanding, this would mean that CursorPartitionGenerator would create slices with datetime within them but this is not what I see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lazebnyi @pnilan @maxi297 to close the loop on this one. I think what originally happened was i wrote this in when i was first testing because we were getting the datetime object from the ConcurrentCursor and it would fail trying to serialize it. However, later after cleaning up the code and fixing edge cases, I addressed the serialization by converting the datetime into the correct output string format with the correct precision in the generate() function here in https://github.com/airbytehq/airbyte/pull/46662/files#diff-93127bface0b323fe43b21cdb8fb14493dd465995b085a4f81647f3697930bddR396-R399 . And since this was now already a string we don't need to convert it again I'll get rid of this code as it's not actually used anymore and we're applying the correct precision based on the cursor definition. |
||
|
||
# Let the base class default method raise the TypeError | ||
return super().default(obj) | ||
|
||
|
@@ -341,12 +348,17 @@ class CursorPartitionGenerator(PartitionGenerator): | |
across partitions. Each partition represents a subset of the stream's data and is determined by the cursor's state. | ||
""" | ||
|
||
_START_BOUNDARY = 0 | ||
_END_BOUNDARY = 1 | ||
|
||
def __init__( | ||
self, | ||
stream: Stream, | ||
message_repository: MessageRepository, | ||
cursor: Cursor, | ||
connector_state_converter: DateTimeStreamStateConverter, | ||
cursor_field: Optional[List[str]], | ||
slice_boundary_fields: Optional[Tuple[str, str]], | ||
): | ||
""" | ||
Initialize the CursorPartitionGenerator with a stream, sync mode, and cursor. | ||
|
@@ -362,6 +374,8 @@ def __init__( | |
self._cursor = cursor | ||
self._cursor_field = cursor_field | ||
self._state = self._cursor.state | ||
self._slice_boundary_fields = slice_boundary_fields | ||
self._connector_state_converter = connector_state_converter | ||
|
||
def generate(self) -> Iterable[Partition]: | ||
""" | ||
|
@@ -372,8 +386,19 @@ def generate(self) -> Iterable[Partition]: | |
|
||
:return: An iterable of StreamPartition objects. | ||
""" | ||
for slice_start, slice_end in self._cursor.generate_slices(): | ||
stream_slice = StreamSlice(partition={}, cursor_slice={"start": slice_start, "end": slice_end}) | ||
|
||
start_boundary = self._slice_boundary_fields[self._START_BOUNDARY] if self._slice_boundary_fields else "start" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For current implementation of datetime cursor |
||
end_boundary = self._slice_boundary_fields[self._END_BOUNDARY] if self._slice_boundary_fields else "end" | ||
|
||
wam = list(self._cursor.generate_slices()) | ||
for slice_start, slice_end in wam: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: It seems like it would be more memory efficient to directly iterate over the generated slices, is there a specific reason for saving to a list? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. correct. I had originally added this for debugging to see the entire set of slices easier, but you are correct this should be iterable |
||
stream_slice = StreamSlice( | ||
partition={}, | ||
cursor_slice={ | ||
start_boundary: self._connector_state_converter.output_format(slice_start), | ||
end_boundary: self._connector_state_converter.output_format(slice_end), | ||
}, | ||
) | ||
|
||
yield StreamPartition( | ||
self._stream, | ||
|
@@ -386,7 +411,7 @@ def generate(self) -> Iterable[Partition]: | |
) | ||
|
||
|
||
@deprecated("This class is experimental. Use at your own risk.", category=ExperimentalClassWarning) | ||
@deprecated("Availability strategy has been soft deprecated. Do not use. Class is subject to removal", category=ExperimentalClassWarning) | ||
class AvailabilityStrategyFacade(AvailabilityStrategy): | ||
def __init__(self, abstract_availability_strategy: AbstractAvailabilityStrategy): | ||
self._abstract_availability_strategy = abstract_availability_strategy | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,6 +67,7 @@ def check_availability(self, logger: logging.Logger) -> StreamAvailability: | |
""" | ||
|
||
|
||
@deprecated("This class is experimental. Use at your own risk.", category=ExperimentalClassWarning) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @brianjlai probably better to call out that it should not be used at all, if we're ripping out availability strategies over mid-long term? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah yeah i think i carried that over from a merge of serhii's work which deprecated availability strategy in concurrent. I'll update this to say do not use |
||
class AlwaysAvailableAvailabilityStrategy(AbstractAvailabilityStrategy): | ||
""" | ||
An availability strategy that always indicates a stream is available. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Teach me your ways, what is the difference betweeen streams and all_streams? (ignore if it's in this PR, reading through it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I mention it in a comment: https://github.com/airbytehq/airbyte/pull/46662/files#r1792816384
But the reason why we can't just rewrite the
streams()
method is because within the existing Python CDKcore.py
, when processing synchronous streams, we invoke thestreams()
method and in that context we don't want to return the concurrent streams that aren't compatible in that are of code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed earlier, it’s preferable to use the stream method and condition the behavior accordingly. This approach adds some complexity, but it provides a tradeoff by allowing simpler modifications later. With this setup, when the core will be able to handle concurrent streams, we’ll get a stream generation interface for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this is addressed in my latest commit using the optional param
include_concurrent_streams