Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK support for Database-type Streams #74

Closed
MeltyBot opened this issue Mar 31, 2021 · 1 comment
Closed

SDK support for Database-type Streams #74

MeltyBot opened this issue Mar 31, 2021 · 1 comment

Comments

@MeltyBot
Copy link
Contributor

MeltyBot commented Mar 31, 2021

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/74

Originally created by @aaronsteers on 2021-03-31 18:16:33


This feature would add formal support for Database type streams.

Background

  • We removed database-type streams from the initial 0.1.0 on the basis of being a lower priority versus API-type streams.
  • Unlike SaaS taps, for DB-type taps, we can assume:
    1. Catalogs are less stable. We should expect cached catalogs will require explicit refreshing.
    2. Catalog detection should be decoupled from Stream definition. We expect the catalog to be defined by querying the information_schema or similar - and it should be much more performant to query this at the DB-level or schema-level, versus individually for each stream/table.
    3. Fewer Stream classes are needed. One Stream class per source or one per extraction type is probably sufficient.
    4. Catalogs are more authoritative. If a catalog declares a table and column, the Tap should assume it exists as defined in the catalog.
      1. Types need to be overridable. Moreso than other stream types, we know database-type sources have a history of type incompatibilities. When provided, we should take care that we apply custom type declarations as given in the input catalog.

Implementation Proposal

Decoupling Streams from Catalog Discovery

Instead of streams objects reporting back their own schema, a SQLCatalogDiscovery class (or similar) will create a full catalog. Then, instances of the appropriate Stream class can be instantiated from the discovered catalog. This also means that the if an input catalog is provided, we can skip the discovery process entirely.

Docs links:

Entry-Level capabilities with SQLAlchemy

The SQLAlchemy tool (powered by DB-API 2.0) can provide out-of-box catalog discovery capabilities as well as entry-level select capabilities. This means that given a valid SQLAlchemy driver and connection URI, we can provide generic discovery and get_records() capabilities.

  • NOTE: We should probably develop a generic tap-sqlalchemy or tap-dbapi which purely leverage SQL Alchemy generic constructs. This would never be as performant as a custom-built tap but it would be good in general for interop purposes.

Developers Provide Performance improvements

There is probably little case for overriding discovery unless SQLAlchemy does not support Inspection for the database type in question. Since catalog discovery is generally not a performance bottleneck, and since it is likely to be cached anyway, a generic implementation should be "good enough" for 95% of DB types.

Provide developers with bulk-based performance boosts

The SQLAlchemy library will be good at generic selects but will not be able to take advantage of batching capabilities. I suggest we tap into the discussions around #9 (batch message type) to allow developers to define their batch capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants