-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dagster-azure package #2483
Conversation
python_modules/libraries/dagster-azure/dagster_azure/blob/blob_fake_resource.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-azure/dagster_azure/adls2/intermediate_store.py
Outdated
Show resolved
Hide resolved
|
||
file = self.file_system_client.create_file(key) | ||
with file.acquire_lease(self.lease_duration) as lease: | ||
with BytesIO() as bytes_io: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is used in s3/object_store, adls2/object_store, and maybe blob/object_store as well, I think it's worth factoring out into a shared helper.
Ah good point, I committed it originally but haven't had time to add tests
for the blob storage alternatives, so removed it later. I think I'll remove
the comment for now and create a separate PR later for these, unless you'd
rather have them all at once?
Aside: the implementations are very similar so there's probably a lot of
room for abstraction here, but I don't feel super qualified to do that!
…On Wed, 20 May 2020, 21:21 Sandy Ryza, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
python_modules/libraries/dagster-azure/dagster_azure/adls2/intermediate_store.py
<#2483 (comment)>:
> +from dagster import check
+from dagster.core.storage.intermediate_store import IntermediateStore
+from dagster.core.storage.type_storage import TypeStoragePluginRegistry
+
+from .object_store import ADLS2ObjectStore
+
+
+class ADLS2IntermediateStore(IntermediateStore):
+ '''Intermediate store using Azure Data Lake Storage Gen2.
+
+ This intermediate store uses ADLS2 APIs to communicate with the storage,
+ which are better optimised for various tasks than regular Blob storage.
+
+ If your storage account does not have the ADLS Gen2 hierarchical namespace enabled
+ this will not work: use the
+ :py:class:`~dagster_azure.blob.intermediate_store.AzureBlobIntermediateStore`
I don't see that in this PR - am I missing something?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2483 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJWHH5ASQHEKGTBSX74L5LRSQ3WPANCNFSM4NFEP3DQ>
.
|
Sounds great. In general, it's easier for us to review and merge work when it's broken up into smaller changes. blob_fake_resource.py should probably be removed too, right? |
That has raised another point though. If a user were to use the
which I don't love, but it's not too bad. It feels more palatable if it's extracted into a function like |
5edc68a
to
dea3dbb
Compare
python_modules/libraries/dagster-azure/dagster_azure/blob/blob_fake_resource.py
Outdated
Show resolved
Hide resolved
Would we be able to address this by having |
python_modules/libraries/dagster-azure/dagster_azure/adls2/adls2_fake_resource.py
Outdated
Show resolved
Hide resolved
This is looking close! I kicked off a test run, and it looks like there are some lint errors and test failures. Let me know if you have any trouble viewing those. |
I've resolved most comments now so I think it's close to ready! Tests seem to be passing on CI but I can't tell where the lint errors are: running A couple of outstanding things:
|
Regarding the lint issues, are you able to access this link? https://buildkite.com/dagster/dagster/builds/12880#c98133bb-0a6a-4eeb-b1bc-753551c951bc Here's what I'm seeing:
For the "isort" issues, you should be able to run |
Oof, this snowflake dependency issue is tough. Where do you end up observing the error? In dagster-examples? dagster-azure and dagster-snowflake don't depend on each other, right? |
Cheers, done and committed.
Yep I think they may be because I hadn't included dagster-azure in the
Yeah it was just in dagster-examples, although it might also happen when installing the dev python modules now since both packages will be included there... |
I just kicked off another test build. Aside from the snowflake issue, this LGTM! When you have some time, ping me on Slack and we can brainstorm on what to do about snowflake. |
@sryza and I discussed the dependency conflict between snowflake-connector-python and dagster-azure on Slack and found:
The lack of solution isn't great, but until snowflakedb/snowflake-connector-python#284 is resolved there's not much that we can do. I'll wrap the offending imports in both packages in a try/except block to add a bit more context to avoid too much confusion for users. The failing tests highlight one remaining problem: some tests require some
Let me know your thoughts! |
Gotcha - I'm looking into creating those on the dagster side. |
This adds the following components based on Azure Data Lake Storage Gen2 (and Azure Blob Storage where appropriate): - ADLS2FileCache and adls2_file_cache - ADLS2FileManager - ADLS2IntermediateStore - ADLS2ObjectStore - the adls2_resource providing direct access to Azure Data Lake Storage - the adls2_system_storage system storage This is pretty similar to the S3 implementation, the main difference being configuration: Azure's SDK requires credentials to be passed explicitly, so the credential is expected in configuration. Tests currently require an access key to complete any tests marked 'nettest'.
…plementations These work but have no immediate use case and no tests, so seem like an unnecessary maintenance burden. This commit can be reverted if they're needed!
This centralizes the various azure-storage/azure-core imports and wraps them, plus the snowflake-connector import, in a try/except block, adding a custom warning with a suggested solution if the import fails.
…-snowflake's README
@sryza Sure thing, I've rebased and fixed one more isort complaint. I'm guessing tests will still fail though because the Azure key added by @natekupp won't have access to the storage account/container configured in conftest.py - I can edit that to point to a storage account/container in your control once they've been created! |
Good point - I made a PR against this PR to fix that: sd2k#1 |
Set buildkite container for dagster-azure tests
Summary: This adds the following components based on Azure Data Lake Storage Gen2 (and Azure Blob Storage where appropriate): - ADLS2FileCache and adls2_file_cache - ADLS2FileManager - ADLS2IntermediateStore - ADLS2ObjectStore - the adls2_resource providing direct access to Azure Data Lake Storage - the adls2_system_storage system storage This is pretty similar to the S3 implementation, the main difference being configuration: Azure's SDK requires credentials to be passed explicitly, so the credential is expected in configuration. Tests currently require an access key to complete any tests marked 'nettest'. Rename Fake Azure classes and modules to more English-friendly names Add ADLS2Resource to wrap ADLS2/Blob clients Fix import order in dagster-azure Add dagster-azure to install_dev_python_modules make target Include azure-storage-blob in dagster-azure requirements Remove unused variable in tests Don't install dagster-azure as part of install_dev_python_modules make target Remove accidentally committed Azure Blob object/intermediate store implementations These work but have no immediate use case and no tests, so seem like an unnecessary maintenance burden. This commit can be reverted if they're needed! Wrap potentially incompatible imports to add a custom warning This centralizes the various azure-storage/azure-core imports and wraps them, plus the snowflake-connector import, in a try/except block, adding a custom warning with a suggested solution if the import fails. Add README to dagster-azure and note about incompatibility to dagster-snowflake's README Isort Set buildkite container for dagster-azure tests Merge pull request #1 from dagster-io/dagster-azure Set buildkite container for dagster-azure tests Env variables in buildkite for Azure Test Plan: bk Differential Revision: https://dagster.phacility.com/D3238
@sryza I've fixed a few issues which popped up during the rebase + snowflake fix. I'm hopeful that tests will pass now! |
I think the changes in this PR (specifically in .buildkite/pipelines.py) will also be needed to get things working: https://github.com/sd2k/dagster/pull/2/files |
One test timed out for some mysterious reason and I'm unsure how to retry it, but otherwise I think this is good to go. |
Ok this LGTM! Thanks for bearing with us on this buildkite Odyssey |
Summary: #3573 #2483 (comment) Test Plan: in fresh virtualenv ``` ~/dev/dagster arcpatch-D6207 $ make dev_install ~/dev/dagster arcpatch-D6207 $ test-D6207 $ pip freeze | grep azure azure-common==1.1.26 azure-core==1.10.0 azure-storage-blob==12.3.2 azure-storage-file-datalake==12.0.2 -e [email protected]:dagster-io/dagster.git@bd35c74ff476078799a55650d70fa5c28b43d373#egg=dagster_azure&subdirectory=python_modules/libraries/dagster-azure ~/dev/dagster/docs arcpatch-D6207 test-D6207 $ make buildnext && cd next && yarn && yarn dev ``` {F536160} Reviewers: sashank, nate, sandyryza Reviewed By: nate, sandyryza Differential Revision: https://dagster.phacility.com/D6207
Summary: dagster-io#3573 dagster-io#2483 (comment) Test Plan: in fresh virtualenv ``` ~/dev/dagster arcpatch-D6207 $ make dev_install ~/dev/dagster arcpatch-D6207 $ test-D6207 $ pip freeze | grep azure azure-common==1.1.26 azure-core==1.10.0 azure-storage-blob==12.3.2 azure-storage-file-datalake==12.0.2 -e [email protected]:dagster-io/dagster.git@bd35c74ff476078799a55650d70fa5c28b43d373#egg=dagster_azure&subdirectory=python_modules/libraries/dagster-azure ~/dev/dagster/docs arcpatch-D6207 test-D6207 $ make buildnext && cd next && yarn && yarn dev ``` {F536160} Reviewers: sashank, nate, sandyryza Reviewed By: nate, sandyryza Differential Revision: https://dagster.phacility.com/D6207
This PR adds the dagster-azure package which provides various storage component implementations using Azure Data Lake Storage. New components are:
This is pretty similar to the S3 implementation, the main difference being configuration: Azure's SDK requires credentials to be passed explicitly, so the credential is expected in configuration.
Tests currently require an access key to complete any tests marked 'nettest - I guess this will need to be passed over to a new Azure storage account under dagster's control at some point.
One other issue I just remembered is a dependency version conflict with the snowflake-connector-python package, which is being tracked here. Not quite sure how to resolve that...