Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate Pandas Dataframe Solid into Two Sources #1

Closed
schrockn opened this issue May 30, 2018 · 1 comment
Closed

Separate Pandas Dataframe Solid into Two Sources #1

schrockn opened this issue May 30, 2018 · 1 comment
Assignees

Comments

@schrockn
Copy link
Member

Right now because of the first architecture the pandas dataframe_solid has a single source which is called "CSVORPARQUET" and takes a format argument. This should be separated into two distinct sources, one Csv, one Parquet

@schrockn
Copy link
Member Author

schrockn commented Jun 7, 2018

So I think this will be an excellent issue for you to work through some of the abstractions.

Currently this function creates a dependency between two dataframe-flavored solids:

def create_dagster_pd_dependency_input(solid):
    check.inst_param(solid, 'solid', Solid)

    def dependency_input_fn(context, arg_dict):
        check.inst_param(context, 'context', DagsterExecutionContext)
        path = check.str_elem(arg_dict, 'path')
        frmt = check.str_elem(arg_dict, 'format')

        df = _read_df(path, frmt)

        context.metric('rows', df.shape[0])

        return df

    return create_single_source_input(
        name=solid.name,
        source_fn=dependency_input_fn,
        argument_def_dict={
            'path': types.PATH,
            'format': types.STRING,
        },
        depends_on=solid,
        source_type='CSVORPARQUET',
    )

This should instead construct two separate sources for the input. One CSV, one Parquet. (Also adding JSON would make sense etc).

sryza pushed a commit that referenced this issue Jun 2, 2020
Set buildkite container for dagster-azure tests
sryza pushed a commit that referenced this issue Jun 3, 2020
Summary:
This adds the following components based on Azure Data Lake Storage
Gen2 (and Azure Blob Storage where appropriate):

- ADLS2FileCache and adls2_file_cache
- ADLS2FileManager
- ADLS2IntermediateStore
- ADLS2ObjectStore
- the adls2_resource providing direct access to Azure Data Lake Storage
- the adls2_system_storage system storage

This is pretty similar to the S3 implementation, the main difference
being configuration: Azure's SDK requires credentials to be passed
explicitly, so the credential is expected in configuration.

Tests currently require an access key to complete any tests marked
'nettest'.

Rename Fake Azure classes and modules to more English-friendly names

Add ADLS2Resource to wrap ADLS2/Blob clients

Fix import order in dagster-azure

Add dagster-azure to install_dev_python_modules make target

Include azure-storage-blob in dagster-azure requirements

Remove unused variable in tests

Don't install dagster-azure as part of install_dev_python_modules make target

Remove accidentally committed Azure Blob object/intermediate store implementations

These work but have no immediate use case and no tests, so seem like an
unnecessary maintenance burden. This commit can be reverted if they're needed!

Wrap potentially incompatible imports to add a custom warning

This centralizes the various azure-storage/azure-core imports and wraps them,
plus the snowflake-connector import, in a try/except block, adding a custom
warning with a suggested solution if the import fails.

Add README to dagster-azure and note about incompatibility to dagster-snowflake's README

Isort

Set buildkite container for dagster-azure tests

Merge pull request #1 from dagster-io/dagster-azure

Set buildkite container for dagster-azure tests

Env variables in buildkite for Azure

Test Plan: bk

Differential Revision: https://dagster.phacility.com/D3238
sryza added a commit that referenced this issue Jun 3, 2020
* Add dagster-azure package with various storage components

This adds the following components based on Azure Data Lake Storage
Gen2 (and Azure Blob Storage where appropriate):

- ADLS2FileCache and adls2_file_cache
- ADLS2FileManager
- ADLS2IntermediateStore
- ADLS2ObjectStore
- the adls2_resource providing direct access to Azure Data Lake Storage
- the adls2_system_storage system storage

This is pretty similar to the S3 implementation, the main difference
being configuration: Azure's SDK requires credentials to be passed
explicitly, so the credential is expected in configuration.

Tests currently require an access key to complete any tests marked
'nettest'.

* Rename Fake Azure classes and modules to more English-friendly names

* Add ADLS2Resource to wrap ADLS2/Blob clients

* Fix import order in dagster-azure

* Add dagster-azure to install_dev_python_modules make target

* Include azure-storage-blob in dagster-azure requirements

* Remove unused variable in tests

* Don't install dagster-azure as part of install_dev_python_modules make target

* Remove accidentally committed Azure Blob object/intermediate store implementations

These work but have no immediate use case and no tests, so seem like an
unnecessary maintenance burden. This commit can be reverted if they're needed!

* Wrap potentially incompatible imports to add a custom warning

This centralizes the various azure-storage/azure-core imports and wraps them,
plus the snowflake-connector import, in a try/except block, adding a custom
warning with a suggested solution if the import fails.

* Add README to dagster-azure and note about incompatibility to dagster-snowflake's README

* Isort

* Set buildkite container for dagster-azure tests

* Env variables in buildkite for Azure

* Add dagster-azure package with various storage components

Summary:
This adds the following components based on Azure Data Lake Storage
Gen2 (and Azure Blob Storage where appropriate):

- ADLS2FileCache and adls2_file_cache
- ADLS2FileManager
- ADLS2IntermediateStore
- ADLS2ObjectStore
- the adls2_resource providing direct access to Azure Data Lake Storage
- the adls2_system_storage system storage

This is pretty similar to the S3 implementation, the main difference
being configuration: Azure's SDK requires credentials to be passed
explicitly, so the credential is expected in configuration.

Tests currently require an access key to complete any tests marked
'nettest'.

Rename Fake Azure classes and modules to more English-friendly names

Add ADLS2Resource to wrap ADLS2/Blob clients

Fix import order in dagster-azure

Add dagster-azure to install_dev_python_modules make target

Include azure-storage-blob in dagster-azure requirements

Remove unused variable in tests

Don't install dagster-azure as part of install_dev_python_modules make target

Remove accidentally committed Azure Blob object/intermediate store implementations

These work but have no immediate use case and no tests, so seem like an
unnecessary maintenance burden. This commit can be reverted if they're needed!

Wrap potentially incompatible imports to add a custom warning

This centralizes the various azure-storage/azure-core imports and wraps them,
plus the snowflake-connector import, in a try/except block, adding a custom
warning with a suggested solution if the import fails.

Add README to dagster-azure and note about incompatibility to dagster-snowflake's README

Isort

Set buildkite container for dagster-azure tests

Merge pull request #1 from dagster-io/dagster-azure

Set buildkite container for dagster-azure tests

Env variables in buildkite for Azure

Test Plan: bk

Differential Revision: https://dagster.phacility.com/D3238

* more buildkite

* tox

* Run black on dagster_snowflake

* Add pylint to dagster-azure tox.ini

* Explicitly specify __all__ for dagster_azure utils modules

* Fix black issues

Co-authored-by: Sandy Ryza <[email protected]>
Co-authored-by: Sandy Ryza <[email protected]>
moonshot-s referenced this issue in wandb/dagster Dec 12, 2022
Add base wandb package resources
benpankow pushed a commit that referenced this issue May 5, 2023
yuhan pushed a commit that referenced this issue Aug 4, 2023
rexledesma added a commit that referenced this issue Nov 14, 2023
## Summary & Motivation
Following up on #17980, one
of the issues with memory usage in `dagster-dbt` is that if the
`manifest` is specified as a path, we potentially hold multiple copies
of the dictionary manifest in memory when creating our assets. This is
because we read from the path to create the dictionary manifest, and
then we hold a reference to the manifest dictionary in our asset
metadata. When the same path is read multiple times, multiple copies
ensure.

We can optimize this by implementing a cache when reading from the path,
so that the same dictionary manifest is returned even when a manifest
path is specified.

To further optimize this to hold no reference to the manifest
dictionary, we could instead hold a reference to the manifest path in
our asset metadata. However,

1. This only works if a manifest path is passed. If a manifest is
already specified as a dictionary, then we still have to hold it in
memory. This is the worst case.
2. (1) can be solved by preventing a manifest from being specified as a
dictionary, but that would be a breaking change.

## How I Tested These Changes
Use `tracemalloc`.

1. Create a `jaffle_dagster` project using `dagster-dbt project
scaffold` on `jaffle_shop`.
2. Update the code to read the manifest from a path in three separate
`@dbt_assets` definitions.
3. Run `PYTHONTRACEMALLOC=1 DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster
dev`

**Before Change**
```diff
Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.4 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
- #3: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-- dbt/dagster_dbt/dbt_manifest.py:17: 3262.9 KiB
-    manifest = cast(Mapping[str, Any], orjson.loads(manifest.read_bytes()))
#4: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 891.8 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53203 other: 38030.8 KiB
Total allocated size: 88177.9 KiB
```

**After Change**
```diff
Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.9 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
#3: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10 /lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
+ #4: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py:13: 1057.4 KiB
+    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 893.6 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53195 other: 38013.7 KiB
Total allocated size: 85957.5 KiB
```

**Baseline (one `@dbt_assets` definition)**

```diff
Top 10 lines
#1: <frozen importlib._bootstrap_external>:672: 37114.2 KiB
#2: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/text_unidecode/__init__.py:6: 3675.4 KiB
    _replaces = pkgutil.get_data(__name__, 'data.bin').decode('utf8').split('\x00')
#3: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/abc.py:106: 1313.7 KiB
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
+ #4: /Users/rexledesma/dagster-labs/dagster/python_modules/libraries/dagster-dbt/dagster_dbt/dbt_manifest.py:13: 1056.9 KiB
+    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
#5: /Users/rexledesma/.pyenv/versions/3.10.13/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py:481: 987.5 KiB
    result = type(typename, (tuple,), class_namespace)
#6: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:114: 978.7 KiB
    self.globals = globals().copy()
#7: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/mashumaro/core/meta/code/builder.py:263: 893.7 KiB
    exec(code, self.globals, self.__dict__)
#8: /Users/rexledesma/.pyenv/versions/3.10.13/envs/dagster/lib/python3.10/site-packages/google/protobuf/internal/builder.py:85: 697.9 KiB
    message_class = _reflection.GeneratedProtocolMessageType(
#9: <frozen importlib._bootstrap_external>:128: 655.2 KiB
#10: <frozen importlib._bootstrap_external>:1616: 569.6 KiB
53175 other: 37976.1 KiB
Total allocated size: 85918.9 KiB
```
johannkm added a commit that referenced this issue Nov 27, 2023
Bug report:
https://dagster.slack.com/archives/C04CW71AGBW/p1700237771181059?thread_ts=1697638899.590569&cid=C04CW71AGBW

Synopsis:

- At definition time, we evaluate @dbt_assets using
`DBT_INDIRECT_SELECTION=eager` [side note, I'm kind of surprised this
isn't configurable].
- At runtime, if you're using checks, we set
`DBT_INDIRECT_SELECTION=empty` so that we have the option to exclude
checks if they're not in the selection
- If we are in a subset, this works because we explicitly pass the
selected models and tests
- However, if we determine we're not in a subset, we pass the default
selection string, which may not select the tests now that we're not
using eager.

Solutions:

1. [this diff] If asset checks are enabled, always pass an explicit
selection of models and tests
2. The reason we try to use the default selection string when we can is
to make things readable from the dbt side. We could try to preserve that
by only setting `DBT_INDIRECT_SELECTION=empty` when we're doing a subset
and passing the explicit selection, otherwise use eager and the default
selection.

A concern for (2) is relationship tests. Take the case of two models

- `customers,` one non null test
- `orders`, one relationship test with customers

Dbt will resolve selection `customers` to the `customers` model and both
the null test and the relationship test. If you're not using asset
checks this is fine, we'll emit observations for both tests. If you are
using checks, currently `@dbt_assets(...select="customers")` will only
define the null test. This makes sense- we wouldn't even display the
relationship test if we ingested it, because we didn't ingest the model
its on.

With #1, we won't execute the relationship test (which matches the
current behavior). With #2 we will whenever we switch over to eager,
which is unexpected because we didn't model it.

Side note: we should document that when you enable checks, relationship
tests on non selected models will no longer run

---------

Co-authored-by: Rex Ledesma <[email protected]>
cmpadden pushed a commit that referenced this issue Apr 11, 2024
Remove dagster SDAs and fix asset materialization name
cmpadden added a commit that referenced this issue Jul 23, 2024
cmpadden added a commit that referenced this issue Jul 26, 2024
cmpadden added a commit that referenced this issue Aug 8, 2024
cmpadden added a commit that referenced this issue Aug 23, 2024
cmpadden added a commit that referenced this issue Aug 23, 2024
cmpadden added a commit that referenced this issue Sep 4, 2024
## Summary & Motivation

- Fixes linting errors thrown by Vale (36 errors remain, but many are
TODOs or false-positives)

## How I Tested These Changes

- Running `vale` locally

## Changelog [New | Bug | Docs]

NOCHANGELOG
izzye84 pushed a commit that referenced this issue Sep 4, 2024
## Summary & Motivation

- Fixes linting errors thrown by Vale (36 errors remain, but many are
TODOs or false-positives)

## How I Tested These Changes

- Running `vale` locally

## Changelog [New | Bug | Docs]

NOCHANGELOG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants