Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dagster-azure package #2483

Merged
merged 24 commits into from
Jun 3, 2020
Merged

Add dagster-azure package #2483

merged 24 commits into from
Jun 3, 2020

Conversation

sd2k
Copy link
Contributor

@sd2k sd2k commented May 19, 2020

This PR adds the dagster-azure package which provides various storage component implementations using Azure Data Lake Storage. New components are:

  • ADLS2FileCache and adls2_file_cache
  • ADLS2FileManager
  • ADLS2IntermediateStore
  • ADLS2ObjectStore
  • the adls2_resource providing direct access to Azure Data Lake Storage
  • the adls2_system_storage system storage

This is pretty similar to the S3 implementation, the main difference being configuration: Azure's SDK requires credentials to be passed explicitly, so the credential is expected in configuration.

Tests currently require an access key to complete any tests marked 'nettest - I guess this will need to be passed over to a new Azure storage account under dagster's control at some point.

One other issue I just remembered is a dependency version conflict with the snowflake-connector-python package, which is being tracked here. Not quite sure how to resolve that...

@sd2k sd2k force-pushed the dagster-azure branch from 9e39884 to 51ccc9b Compare May 20, 2020 16:13
@sd2k sd2k marked this pull request as ready for review May 20, 2020 16:18
@sd2k sd2k force-pushed the dagster-azure branch from 51ccc9b to 03e9468 Compare May 20, 2020 16:27
@sd2k sd2k changed the title WIP - add dagster-azure package Add dagster-azure package May 20, 2020

file = self.file_system_client.create_file(key)
with file.acquire_lease(self.lease_duration) as lease:
with BytesIO() as bytes_io:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is used in s3/object_store, adls2/object_store, and maybe blob/object_store as well, I think it's worth factoring out into a shared helper.

@sd2k
Copy link
Contributor Author

sd2k commented May 20, 2020 via email

@sryza
Copy link
Contributor

sryza commented May 20, 2020

I think I'll remove the comment for now and create a separate PR later for these, unless you'd
rather have them all at once?

Sounds great. In general, it's easier for us to review and merge work when it's broken up into smaller changes. blob_fake_resource.py should probably be removed too, right?

@sd2k
Copy link
Contributor Author

sd2k commented May 21, 2020

I think I'll remove the comment for now and create a separate PR later for these, unless you'd
rather have them all at once?

Sounds great. In general, it's easier for us to review and merge work when it's broken up into smaller changes. blob_fake_resource.py should probably be removed too, right?

blob_fake_resource.py is required by some other tests unfortunately. ObjectStore requires subclasses to implement cp_object, but the Azure Data Lake APIs don't provide any methods to copy objects. The underlying storage can also be accessed using the Blob Storage APIs though, which do include a method to copy objects, so I needed to use those to implement cp_object. That's what the blob_fake_resource module is for (it provides a mock Blob Storage API).

That has raised another point though. If a user were to use the create_adls2_fake_resource to provide the adls2 resource to a pipeline (like the airline demo does for S3), the storage system's ADLS2ObjectStore would correctly use the fake ADLS2 resource, but would incorrectly attempt to create a real blob storage client rather than using a fake one (see dagster_azure.adls.ADLS2ObjectStore.__init__). That's because ADLS2ObjectStore.cp_object is the only thing which actually needs to use the blob storage APIs, so it just creates its own client reusing the credentials/config of the ADLS2 client. I think this should work:

        self.adls2_client = client
        self.file_system_client = self.adls2_client.get_file_system_client(file_system)
        # We also need a blob client to handle copying as ADLS doesn't have a copy API yet
        if isinstance(client, ADLS2FakeClient):
            self.blob_client = create_blob_fake_resource(client.account_name)
        else:
            self.blob_client = create_blob_client(
                client.account_name,
                # client.credential is non-null if a secret key was used to authenticate
                client.credential.account_key if client.credential is not None
                # otherwise the SAS token will be in the query string of the URL
                else urlparse(client.url).query,
            )
        self.blob_container_client = self.blob_client.get_container_client(file_system)

which I don't love, but it's not too bad. It feels more palatable if it's extracted into a function like _blob_client_for_adls2_client.

@sd2k sd2k force-pushed the dagster-azure branch 2 times, most recently from 5edc68a to dea3dbb Compare May 21, 2020 07:50
@sryza
Copy link
Contributor

sryza commented May 21, 2020

blob_fake_resource.py is required by some other tests unfortunately.
Got it, makes sense.

. If a user were to use the create_adls2_fake_resource to provide the adls2 resource to a pipeline (like the airline demo does for S3), the storage system's ADLS2ObjectStore would correctly use the fake ADLS2 resource, but would incorrectly attempt to create a real blob storage client rather than using a fake one (see dagster_azure.adls.ADLS2ObjectStore.init)

Would we be able to address this by having adls2_resource return an object that wraps the ADLS2Client and has both get_adls2_client and get_blob_client methods? Maybe called an ADLS2Resource or something? (And there could be a FakeADLS2Resource for tests and the airline demo).

@sryza
Copy link
Contributor

sryza commented May 21, 2020

This is looking close! I kicked off a test run, and it looks like there are some lint errors and test failures. Let me know if you have any trouble viewing those.

@sd2k
Copy link
Contributor Author

sd2k commented May 21, 2020

I've resolved most comments now so I think it's close to ready! Tests seem to be passing on CI but I can't tell where the lint errors are: running make lint from my repository root gives thousands of lines of errors!

A couple of outstanding things:

  • the dependency conflict with the snowflake-connector-python package, which requires azure-storage-blob<12.0
  • this comment
  • Squash?

@sryza
Copy link
Contributor

sryza commented May 21, 2020

Regarding the lint issues, are you able to access this link?

https://buildkite.com/dagster/dagster/builds/12880#c98133bb-0a6a-4eeb-b1bc-753551c951bc

Here's what I'm seeing:


************* Module dagster_azure.adls2.file_cache
--
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/file_cache.py:1:0: E0611: No name 'core' in module 'azure' (no-name-in-module)
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/file_cache.py:1:0: E0401: Unable to import 'azure.core.exceptions' (import-error)
  | ************* Module dagster_azure.adls2.fake_adls2_resource
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/fake_adls2_resource.py:6:0: E0611: No name 'core' in module 'azure' (no-name-in-module)
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/fake_adls2_resource.py:6:0: E0401: Unable to import 'azure.core.exceptions' (import-error)
  | ************* Module dagster_azure.adls2.object_store
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/object_store.py:6:0: E0611: No name 'core' in module 'azure' (no-name-in-module)
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/object_store.py:6:0: E0401: Unable to import 'azure.core.exceptions' (import-error)
  | ************* Module dagster_azure.adls2.utils
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/utils.py:1:0: E0611: No name 'filedatalake' in module 'azure.storage' (no-name-in-module)
  | python_modules/libraries/dagster-azure/dagster_azure/adls2/utils.py:1:0: E0401: Unable to import 'azure.storage.filedatalake' (import-error)
  | ************* Module dagster_azure.blob.fake_blob_client
  | python_modules/libraries/dagster-azure/dagster_azure/blob/fake_blob_client.py:6:0: E0611: No name 'core' in module 'azure' (no-name-in-module)
  | python_modules/libraries/dagster-azure/dagster_azure/blob/fake_blob_client.py:6:0: E0401: Unable to import 'azure.core.exceptions' (import-error)
  | ************* Module dagster_azure.blob.compute_log_manager
  | python_modules/libraries/dagster-azure/dagster_azure/blob/compute_log_manager.py:5:0: E0611: No name 'generate_blob_sas' in module 'azure.storage.blob' (no-name-in-module)
  | ************* Module dagster_azure.blob.utils
  | python_modules/libraries/dagster-azure/dagster_azure/blob/utils.py:1:0: E0611: No name 'BlobServiceClient' in module 'azure.storage.blob' (no-name-in-module)
  | ************* Module dagster_azure.blob.object_store
  | python_modules/libraries/dagster-azure/dagster_azure/blob/object_store.py:6:0: E0611: No name 'core' in module 'azure' (no-name-in-module)
  | python_modules/libraries/dagster-azure/dagster_azure/blob/object_store.py:6:0: E0401: Unable to import 'azure.core.exceptions' (import-error)
  | ************* Module dagster_azure_tests.adls2_tests.test_intermediate_store
  | python_modules/libraries/dagster-azure/dagster_azure_tests/adls2_tests/test_intermediate_store.py:102:4: W0612: Unused variable 'err' (unused-variable)

For the "isort" issues, you should be able to run make isort from the project root.

@sryza
Copy link
Contributor

sryza commented May 21, 2020

Oof, this snowflake dependency issue is tough. Where do you end up observing the error? In dagster-examples? dagster-azure and dagster-snowflake don't depend on each other, right?

@sd2k
Copy link
Contributor Author

sd2k commented May 22, 2020

For the "isort" issues, you should be able to run make isort from the project root.

Cheers, done and committed.

Regarding the lint issues, are you able to access this link?

Yep I think they may be because I hadn't included dagster-azure in the install_dev_python_modules make target, which I've pushed a fix for now.

Oof, this snowflake dependency issue is tough. Where do you end up observing the error? In dagster-examples? dagster-azure and dagster-snowflake don't depend on each other, right?

Yeah it was just in dagster-examples, although it might also happen when installing the dev python modules now since both packages will be included there...

@sryza
Copy link
Contributor

sryza commented May 23, 2020

I just kicked off another test build. Aside from the snowflake issue, this LGTM! When you have some time, ping me on Slack and we can brainstorm on what to do about snowflake.

@sd2k
Copy link
Contributor Author

sd2k commented May 28, 2020

@sryza and I discussed the dependency conflict between snowflake-connector-python and dagster-azure on Slack and found:

  • the issue occurs at import-time and lint-time when both dagster-azure and dagster-snowflake are installed alongside each other
  • this generally only occurs when people run make install_dev_python_modules e.g. for development, to run examples, or to run lints in CI
  • we can stop this happening in CI by running lints against each package separately (as for tests), rather than installing all packages together. I believe this change has now been made, which unblocks failing CI (for this reason, at least)
  • otherwise there's no real solution to the conflicting dependencies between the two packages, so we need to add a warning that dagster-azure can't be used alongside dagster-snowflake for now, and remove dagster-azure from the install_dev_python_modules Make target (with a comment so it isn't added back in prematurely)

The lack of solution isn't great, but until snowflakedb/snowflake-connector-python#284 is resolved there's not much that we can do. I'll wrap the offending imports in both packages in a try/except block to add a bit more context to avoid too much confusion for users.


The failing tests highlight one remaining problem: some tests require some AZURE_ environment variables to be set to access a storage account to run live tests, similar to the S3/GCP ones. This really means:

  • a new Azure Data Lake storage account under your control, with a pre-created filesystem
  • the two conftest.pys updatng to match the new storage account/filesystem
  • the secret key adding to CI under AZURE_STORAGE_ACCOUNT_KEY

Let me know your thoughts!

@sryza
Copy link
Contributor

sryza commented May 29, 2020

The failing tests highlight one remaining problem: some tests require some AZURE_ environment variables to be set to access a storage account to run live tests, similar to the S3/GCP ones. This really means:

Gotcha - I'm looking into creating those on the dagster side.

@sryza
Copy link
Contributor

sryza commented Jun 1, 2020

Hey @sd2k - @natekupp just added an Azure key to our build system, under the environment variable you mentioned. Are you able to rebase on the latest master to see if that addresses the test issues you were running into?

sd2k added 5 commits June 2, 2020 09:21
This adds the following components based on Azure Data Lake Storage
Gen2 (and Azure Blob Storage where appropriate):

- ADLS2FileCache and adls2_file_cache
- ADLS2FileManager
- ADLS2IntermediateStore
- ADLS2ObjectStore
- the adls2_resource providing direct access to Azure Data Lake Storage
- the adls2_system_storage system storage

This is pretty similar to the S3 implementation, the main difference
being configuration: Azure's SDK requires credentials to be passed
explicitly, so the credential is expected in configuration.

Tests currently require an access key to complete any tests marked
'nettest'.
sd2k added 7 commits June 2, 2020 09:21
…plementations

These work but have no immediate use case and no tests, so seem like an
unnecessary maintenance burden. This commit can be reverted if they're needed!
This centralizes the various azure-storage/azure-core imports and wraps them,
plus the snowflake-connector import, in a try/except block, adding a custom
warning with a suggested solution if the import fails.
@sd2k sd2k force-pushed the dagster-azure branch from 3a9a6ea to ddb9c3d Compare June 2, 2020 08:21
@sd2k
Copy link
Contributor Author

sd2k commented Jun 2, 2020

@sryza Sure thing, I've rebased and fixed one more isort complaint. I'm guessing tests will still fail though because the Azure key added by @natekupp won't have access to the storage account/container configured in conftest.py - I can edit that to point to a storage account/container in your control once they've been created!

@sryza
Copy link
Contributor

sryza commented Jun 2, 2020

Good point - I made a PR against this PR to fix that: sd2k#1

sd2k and others added 8 commits June 2, 2020 21:24
Set buildkite container for dagster-azure tests
Summary:
This adds the following components based on Azure Data Lake Storage
Gen2 (and Azure Blob Storage where appropriate):

- ADLS2FileCache and adls2_file_cache
- ADLS2FileManager
- ADLS2IntermediateStore
- ADLS2ObjectStore
- the adls2_resource providing direct access to Azure Data Lake Storage
- the adls2_system_storage system storage

This is pretty similar to the S3 implementation, the main difference
being configuration: Azure's SDK requires credentials to be passed
explicitly, so the credential is expected in configuration.

Tests currently require an access key to complete any tests marked
'nettest'.

Rename Fake Azure classes and modules to more English-friendly names

Add ADLS2Resource to wrap ADLS2/Blob clients

Fix import order in dagster-azure

Add dagster-azure to install_dev_python_modules make target

Include azure-storage-blob in dagster-azure requirements

Remove unused variable in tests

Don't install dagster-azure as part of install_dev_python_modules make target

Remove accidentally committed Azure Blob object/intermediate store implementations

These work but have no immediate use case and no tests, so seem like an
unnecessary maintenance burden. This commit can be reverted if they're needed!

Wrap potentially incompatible imports to add a custom warning

This centralizes the various azure-storage/azure-core imports and wraps them,
plus the snowflake-connector import, in a try/except block, adding a custom
warning with a suggested solution if the import fails.

Add README to dagster-azure and note about incompatibility to dagster-snowflake's README

Isort

Set buildkite container for dagster-azure tests

Merge pull request #1 from dagster-io/dagster-azure

Set buildkite container for dagster-azure tests

Env variables in buildkite for Azure

Test Plan: bk

Differential Revision: https://dagster.phacility.com/D3238
@sd2k
Copy link
Contributor Author

sd2k commented Jun 3, 2020

@sryza I've fixed a few issues which popped up during the rebase + snowflake fix. I'm hopeful that tests will pass now!

@sryza
Copy link
Contributor

sryza commented Jun 3, 2020

I think the changes in this PR (specifically in .buildkite/pipelines.py) will also be needed to get things working: https://github.com/sd2k/dagster/pull/2/files

@sd2k
Copy link
Contributor Author

sd2k commented Jun 3, 2020

One test timed out for some mysterious reason and I'm unsure how to retry it, but otherwise I think this is good to go.

@sryza
Copy link
Contributor

sryza commented Jun 3, 2020

Ok this LGTM! Thanks for bearing with us on this buildkite Odyssey

@sryza sryza merged commit 59e5e67 into dagster-io:master Jun 3, 2020
@sd2k sd2k deleted the dagster-azure branch June 4, 2020 08:28
yuhan added a commit that referenced this pull request Feb 2, 2021
Summary:
#3573

#2483 (comment)

Test Plan:
in fresh virtualenv
```
~/dev/dagster arcpatch-D6207
$ make dev_install

~/dev/dagster arcpatch-D6207
$ test-D6207 $ pip freeze | grep azure
azure-common==1.1.26
azure-core==1.10.0
azure-storage-blob==12.3.2
azure-storage-file-datalake==12.0.2
-e [email protected]:dagster-io/dagster.git@bd35c74ff476078799a55650d70fa5c28b43d373#egg=dagster_azure&subdirectory=python_modules/libraries/dagster-azure

~/dev/dagster/docs arcpatch-D6207
test-D6207 $ make buildnext && cd next && yarn && yarn dev
```
{F536160}

Reviewers: sashank, nate, sandyryza

Reviewed By: nate, sandyryza

Differential Revision: https://dagster.phacility.com/D6207
mrdavidlaing pushed a commit to mrdavidlaing/dagster that referenced this pull request Feb 20, 2021
Summary:
dagster-io#3573

dagster-io#2483 (comment)

Test Plan:
in fresh virtualenv
```
~/dev/dagster arcpatch-D6207
$ make dev_install

~/dev/dagster arcpatch-D6207
$ test-D6207 $ pip freeze | grep azure
azure-common==1.1.26
azure-core==1.10.0
azure-storage-blob==12.3.2
azure-storage-file-datalake==12.0.2
-e [email protected]:dagster-io/dagster.git@bd35c74ff476078799a55650d70fa5c28b43d373#egg=dagster_azure&subdirectory=python_modules/libraries/dagster-azure

~/dev/dagster/docs arcpatch-D6207
test-D6207 $ make buildnext && cd next && yarn && yarn dev
```
{F536160}

Reviewers: sashank, nate, sandyryza

Reviewed By: nate, sandyryza

Differential Revision: https://dagster.phacility.com/D6207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants