🎉 Source S3 - memory & performance optimisations + advanced CSV options #6615

Phlair · 2021-10-01T15:10:36Z

What

closes #6606

I'm fairly confident I've been able to make some small but impactful improvements here:

significantly reduced the memory footprint of the connector, meaning we shouldn't see that memory issue in the linked issue until the bucket is far more populated with files.
made incremental reads only parse schemas of new files rather than all files, which should massively improve runtimes when number of files in bucket is anything non-trivial.

How

Changing our time-ordered iterator to hold just filepath strings rather than StorageFile objects, therefore consuming much less memory per file in the bucket (saw about 4-5x reduction in my testing).
Adding a min_datetime parameter to _get_master_schema() so we can pass in the state on incremental runs and therefore only use new files to create our master schema (previously using all of them). Since the schema will be saved in state from a previous run, the only time we run the schema inference will be during read where we can pass in the state.

Additionally

Also bundling in this community contribution which surfaces more PyArrow CSV configuration options to UI.

Phlair · 2021-10-01T15:23:29Z

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1295630422
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1295630422
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      8    89%
	 source_acceptance_test/conftest.py                     108    108     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              200     94    53%
	 source_acceptance_test/tests/test_full_refresh.py       18     11    39%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  41     24    41%
	 source_acceptance_test/utils/compare.py                 47     20    57%
	 source_acceptance_test/utils/connector_runner.py        82     49    40%
	 source_acceptance_test/utils/json_schema_helper.py     111     11    90%
	 ------------------------------------------------------------------------
	 TOTAL                                                  856    416    51%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                       Stmts   Miss  Cover
	 --------------------------------------------------------------
	 base_python/__init__.py                       13      0   100%
	 base_python/catalog_helpers.py                10      6    40%
	 base_python/cdk/__init__.py                    0      0   100%
	 base_python/cdk/abstract_source.py            83     59    29%
	 base_python/cdk/streams/__init__.py            0      0   100%
	 base_python/cdk/streams/auth/__init__.py       0      0   100%
	 base_python/cdk/streams/auth/core.py           8      1    88%
	 base_python/cdk/streams/auth/jwt.py            5      5     0%
	 base_python/cdk/streams/auth/oauth.py         37     26    30%
	 base_python/cdk/streams/auth/token.py          9      4    56%
	 base_python/cdk/streams/core.py               63     32    49%
	 base_python/cdk/streams/exceptions.py         10      2    80%
	 base_python/cdk/streams/http.py               67     33    51%
	 base_python/cdk/streams/rate_limiting.py      30     14    53%
	 base_python/cdk/utils/__init__.py              0      0   100%
	 base_python/cdk/utils/casing.py                4      0   100%
	 base_python/client.py                         56     33    41%
	 base_python/entrypoint.py                     70     56    20%
	 base_python/integration.py                    52     25    52%
	 base_python/logger.py                         33     19    42%
	 base_python/schema_helpers.py                 56     41    27%
	 base_python/source.py                         51     34    33%
	 main_dev.py                                    3      3     0%
	 --------------------------------------------------------------
	 TOTAL                                        660    393    40%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20      3    85%
	 source_s3/s3file.py                                                  49      3    94%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      37      2    95%
	 source_s3/source_files_abstract/formats/csv_parser.py                71     20    72%
	 source_s3/source_files_abstract/formats/csv_spec.py                  14      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            41     28    32%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            40     18    55%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       16      0   100%
	 source_s3/source_files_abstract/stream.py                           197     11    94%
	 source_s3/stream.py                                                  43      3    93%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               604    110    82%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20     13    35%
	 source_s3/s3file.py                                                  49     26    47%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      37      0   100%
	 source_s3/source_files_abstract/formats/csv_parser.py                71     19    73%
	 source_s3/source_files_abstract/formats/csv_spec.py                  14      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            41      0   100%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            40     18    55%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       16      3    81%
	 source_s3/source_files_abstract/stream.py                           197    106    46%
	 source_s3/stream.py                                                  43     31    28%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               604    238    61%

antixar

I guess the main reason of performance issues is loading of file objects for filtering/sorting( by last_modified).

    def last_modified(self) -> datetime:
        """
        Using decorator set up boto3 session & s3 resource.
        Note: slight nuance for grabbing this when we have no credentials.

        :return: last_modified property of the blob/file
        """
        bucket = self._provider.get("bucket")
        try:
            obj = self._boto_s3_resource.Object(bucket, self.url)
            return obj.last_modified

And thus your code tries to download all files (including not relevant for incremental mode).
I propose to move this filtering/sorting logic to the bucket listing function:

   def _list_bucket(self, accept_key=lambda k: True) -> Iterator[str]:
       ....
                content = response["Contents"]
                # =======
                # content["LastModified"]
                # =======
                raise Exception(content)
            except KeyError:
                pass
       ....

This function should return relevant sorted filepath values only.

airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py

Phlair · 2021-10-15T13:08:02Z

I propose to move this filtering/sorting logic to the bucket listing function:

I like the thinking but few reasons not to make that change:

the filter/sort logic is currently at the abstract level, meaning adding new sources like GCS / Azure is easier and they don't have to implement that in each one.
we don't have the state available there (timestamp to filter on) without some sizeable refactoring.
We're only using generators until we do the sort, but the eventual memory usage correlating to number of files seems unavoidable to me (without getting into overkill sorting solutions).

I've greatly reduced mem usage (4-5x) by storing just the filepaths rather than objects, this seemed like the problem there.

Phlair · 2021-10-15T13:16:55Z

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1346062203
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1346062203
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      8    89%
	 source_acceptance_test/conftest.py                     108    108     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              200     94    53%
	 source_acceptance_test/tests/test_full_refresh.py       18     11    39%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  41     24    41%
	 source_acceptance_test/utils/compare.py                 47     20    57%
	 source_acceptance_test/utils/connector_runner.py        82     49    40%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  860    419    51%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                       Stmts   Miss  Cover
	 --------------------------------------------------------------
	 base_python/__init__.py                       13      0   100%
	 base_python/catalog_helpers.py                10      6    40%
	 base_python/cdk/__init__.py                    0      0   100%
	 base_python/cdk/abstract_source.py            83     59    29%
	 base_python/cdk/streams/__init__.py            0      0   100%
	 base_python/cdk/streams/auth/__init__.py       0      0   100%
	 base_python/cdk/streams/auth/core.py           8      1    88%
	 base_python/cdk/streams/auth/jwt.py            5      5     0%
	 base_python/cdk/streams/auth/oauth.py         37     26    30%
	 base_python/cdk/streams/auth/token.py          9      4    56%
	 base_python/cdk/streams/core.py               63     32    49%
	 base_python/cdk/streams/exceptions.py         10      2    80%
	 base_python/cdk/streams/http.py               67     33    51%
	 base_python/cdk/streams/rate_limiting.py      30     14    53%
	 base_python/cdk/utils/__init__.py              0      0   100%
	 base_python/cdk/utils/casing.py                4      0   100%
	 base_python/client.py                         56     33    41%
	 base_python/entrypoint.py                     70     56    20%
	 base_python/integration.py                    52     25    52%
	 base_python/logger.py                         33     19    42%
	 base_python/schema_helpers.py                 56     41    27%
	 base_python/source.py                         51     34    33%
	 main_dev.py                                    3      3     0%
	 --------------------------------------------------------------
	 TOTAL                                        660    393    40%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20      3    85%
	 source_s3/s3file.py                                                  49      3    94%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      37      2    95%
	 source_s3/source_files_abstract/formats/csv_parser.py                71     20    72%
	 source_s3/source_files_abstract/formats/csv_spec.py                  14      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            40     18    55%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       16      0   100%
	 source_s3/source_files_abstract/stream.py                           195     10    95%
	 source_s3/stream.py                                                  43      3    93%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               622    125    80%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20     13    35%
	 source_s3/s3file.py                                                  49     26    47%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      37      0   100%
	 source_s3/source_files_abstract/formats/csv_parser.py                71     19    73%
	 source_s3/source_files_abstract/formats/csv_spec.py                  14      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61      3    95%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            40     18    55%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       16      3    81%
	 source_s3/source_files_abstract/stream.py                           195    103    47%
	 source_s3/stream.py                                                  43     31    28%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               622    238    62%

…m pyarrow ReadOptions

…3-csv-advanced-options' into george/fix-s3-oom # Conflicts: # airbyte-integrations/connectors/source-s3/setup.py # airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py # docs/integrations/sources/s3.md

Phlair · 2021-10-15T16:46:23Z

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1346754425
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1346754425
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      8    89%
	 source_acceptance_test/conftest.py                     108    108     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              200     94    53%
	 source_acceptance_test/tests/test_full_refresh.py       18     11    39%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  41     24    41%
	 source_acceptance_test/utils/compare.py                 47     20    57%
	 source_acceptance_test/utils/connector_runner.py        82     49    40%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  860    419    51%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                       Stmts   Miss  Cover
	 --------------------------------------------------------------
	 base_python/__init__.py                       13      0   100%
	 base_python/catalog_helpers.py                10      6    40%
	 base_python/cdk/__init__.py                    0      0   100%
	 base_python/cdk/abstract_source.py            83     59    29%
	 base_python/cdk/streams/__init__.py            0      0   100%
	 base_python/cdk/streams/auth/__init__.py       0      0   100%
	 base_python/cdk/streams/auth/core.py           8      1    88%
	 base_python/cdk/streams/auth/jwt.py            5      5     0%
	 base_python/cdk/streams/auth/oauth.py         37     26    30%
	 base_python/cdk/streams/auth/token.py          9      4    56%
	 base_python/cdk/streams/core.py               63     32    49%
	 base_python/cdk/streams/exceptions.py         10      2    80%
	 base_python/cdk/streams/http.py               67     33    51%
	 base_python/cdk/streams/rate_limiting.py      30     14    53%
	 base_python/cdk/utils/__init__.py              0      0   100%
	 base_python/cdk/utils/casing.py                4      0   100%
	 base_python/client.py                         56     33    41%
	 base_python/entrypoint.py                     70     56    20%
	 base_python/integration.py                    52     25    52%
	 base_python/logger.py                         33     19    42%
	 base_python/schema_helpers.py                 56     41    27%
	 base_python/source.py                         51     34    33%
	 main_dev.py                                    3      3     0%
	 --------------------------------------------------------------
	 TOTAL                                        660    393    40%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20      3    85%
	 source_s3/s3file.py                                                  49      3    94%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      37      2    95%
	 source_s3/source_files_abstract/formats/csv_parser.py                71     20    72%
	 source_s3/source_files_abstract/formats/csv_spec.py                  15      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            40     18    55%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       16      0   100%
	 source_s3/source_files_abstract/stream.py                           195     10    95%
	 source_s3/stream.py                                                  43      3    93%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               623    125    80%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20     13    35%
	 source_s3/s3file.py                                                  49     26    47%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      37      0   100%
	 source_s3/source_files_abstract/formats/csv_parser.py                71     19    73%
	 source_s3/source_files_abstract/formats/csv_spec.py                  15      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61      3    95%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            40     18    55%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       16      3    81%
	 source_s3/source_files_abstract/stream.py                           195    103    47%
	 source_s3/stream.py                                                  43     31    28%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               623    238    62%

tuliren

I'm afraid that this python code is too fancy for me to provide any useful review comments.

airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py

# Conflicts: # airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py

Phlair · 2021-10-19T11:30:30Z