Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow --select and --exclude multiple times #7169

Merged
merged 7 commits into from
Mar 28, 2023

Conversation

dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 commented Mar 14, 2023

resolves #7158
resolves #6800

Problem

Currently, if a user does the following, only the last selector (users_rollup) will be take effect and earlier ones (users) will not:

dbt run --select users --select users_rollup

Selection syntax is confusing enough as-is, and the fact that it accepts multiples but only uses the last one makes it even more confusing.

Options

#7158 describes two options for how to move forward:

  1. Combine multiple instances of --select as if it were a union. Similar for --exclude.
  2. Raise an error (with a helpful message) if more than 1 --select is supplied. Same thing for --exclude.

This PR fleshes out option 1.

Note: most of it would also be applicable if we choose to implement option 2 instead.

Implicitly, there's also an option 0 that keeps the status quo. If we choose option 0, I'd still advocate that we cherry-pick the two test cases in 2824f4a so we can actively affirm that this is the expected and desired behavior.

Breaking change

This is described as a breaking change in the changelog because the behavior changes for certain edge cases (see above).

🎩

Our custom MultiOption class itself doesn't currently have any unit tests. Its main form of testing is the functional tests that depend upon it.

We expect that any parameter with MultiOption should also have (type=tuple or type=ChoiceTuple) and multiple=True.

Here's the exceptions that are raised when either of those expectations are violated:

Not type=tuple and not type=ChoiceTuple:

  File "/Users/dbeatty/projects/dbt-core/core/dbt/cli/options.py", line 28, in __init__
    assert issubclass(option_type, tuple), msg
AssertionError: MultiOption named `output_keys` must be tuple or ChoiceTuple (rather than <class 'list'>)

Not multiple=True:

  File "/Users/dbeatty/projects/dbt-core/core/dbt/cli/options.py", line 22, in __init__
    assert multiple, msg
AssertionError: MultiOption named `output_keys` must have multiple=True (rather than None)

Checklist

@cla-bot cla-bot bot added the cla:yes label Mar 15, 2023
@dbeatty10
Copy link
Contributor Author

dbeatty10 commented Mar 15, 2023

@jtcohen6 we need to make a functional decision between one of these three options:

  • option 0 - status quo: accept multiple --select but only use the last one
  • option 1 - combine multiple instances of --select as if they were provided together, space delimited
  • option 2 - raise an error if multiple --select are supplied

We have same options for --exclude also, and I'm assuming we'll align the choice with the one made for --select.

I'm most attracted to option 1 as I think it would the least surprising to the end user.

@dbeatty10
Copy link
Contributor Author

Besides --select and --exclude, there are two other parameters that use MultiOption:

  • --output-keys
  • --resource-type

So we'd need to make the make a similar decision amongst the three options for each of them as well. They could either be covered in this PR or deferred to a separate issue(s).

@jtcohen6
Copy link
Contributor

@dbeatty10 Thanks for laying out the options so clearly!

I agree that option 0 (status quo) is the most confusing & least desirable.

My initial instinct was for option 2 (raise an error), but I think you've convinced me that option 1 is preferable. It's actually the default expected behavior for many CLIs (including click) when they accept multiple options.

@@ -178,7 +183,7 @@
"Space-delimited listing of node properties to include as custom keys for JSON output "
"(e.g. `--output json --output-keys name resource_type description`)"
),
type=list,
type=tuple,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context

When I left this as type=list, a couple tests failed:

But when I updated it to type=tuple, they passed again:

Although it is not enforced, there is a comment here that says:

MultiOption options must be specified with type=tuple or type=ChoiceTuple

Suggestion

To avoid unforeseen issues in the future, maybe we should enforce that the configured type is one of the following whenever cls=MultiOption?

  • subclass of tuple
  • instance of ChoiceTuple

One option for enforcing this within the __init__ method of MultiOption:

option_type = kwargs.pop("type", None)
if inspect.isclass(option_type):
    assert issubclass(option_type, tuple), "type must be tuple or ChoiceTuple not {}".format(option_type)
else:
    assert isinstance(option_type, ChoiceTuple), "type must be tuple or ChoiceTuple not {}".format(option_type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this suggestion in 33c6cc4

@dbeatty10
Copy link
Contributor Author

you've convinced me that option 1 is preferable. It's actually the default expected behavior for many CLIs (including click) when they accept multiple options.

@jtcohen6

Shall we apply option 1 to each of the following as well? Or just --select and --exclude for now?

  • --output-keys
  • --resource-type

I tested both --output-keys and --resource-type out by hand, and they worked well.

I think those four are all the ones using our custom MultiOption class right now.

Even when specifying cls=MultiOption, multiple=True also needs to be specified in order to opt-in to allowing multiple instances of --{argument}, so we need to explicitly decide which parameters should have multiple=True added.

@jtcohen6
Copy link
Contributor

@dbeatty10 I agree with applying this change consistently across the board, to all existing MultiOption params.

Going forward, it sounds like any new param with MultiOption should also have type=tuple and multiple=True.

@dbeatty10 dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Mar 15, 2023
@dbeatty10 dbeatty10 marked this pull request as ready for review March 15, 2023 18:15
@dbeatty10 dbeatty10 requested a review from a team as a code owner March 15, 2023 18:15
Comment on lines +19 to +30
# validate that multiple=True
multiple = kwargs.pop("multiple", None)
msg = f"MultiOption named `{self.name}` must have multiple=True (rather than {multiple})"
assert multiple, msg

# validate that type=tuple or type=ChoiceTuple
option_type = kwargs.pop("type", None)
msg = f"MultiOption named `{self.name}` must be tuple or ChoiceTuple (rather than {option_type})"
if inspect.isclass(option_type):
assert issubclass(option_type, tuple), msg
else:
assert isinstance(option_type, ChoiceTuple), msg
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very open to feedback here -- I just stuffed quick-n-dirty validation in the easiest place possible.

Comment on lines +73 to +74
if value:
value = tuple(flatten(value))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, but the value can be None. We only want to flatten it when it has a non-None value.

@leahwicz leahwicz requested review from a team and aranke and removed request for VersusFacit and colin-rogers-dbt March 22, 2023 15:59
@aranke
Copy link
Member

aranke commented Mar 22, 2023

Maybe it's just me, but should the behavior be set intersection instead of set union?

Using a Pandas data frame as an example, I know that:

df
  .filter(f1)
  .filter(f2)

will return me rows that are in the intersection of filters f1 and f2.
On a lower level, rows that don't pass filter f1 aren't even evaluated for filter f2.

It's definitely a less ergonomic option (imagining a use case like Slim CI), but feels more mathematically correct.

As I write this, I'm not very convinced of this line of thinking, but I'll leave it up to you to ponder 😄

Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good to me implementation wise!! I will leave the choice of union vs intersect to you @dbeatty.

@dbeatty10
Copy link
Contributor Author

dbeatty10 commented Mar 28, 2023

Maybe it's just me, but should the behavior be set intersection instead of set union?

Good eyes @aranke ! 🤩

Suppose a user expresses --select multiple times like:

dbt run --select users --select users_rollup

Key question

Options

In this case, we have two options of how to combine each of the arguments:

  1. set union
  2. set intersection

Explanations

  1. With a set union
    • chaining uses a logical OR
    • union operator in dbt selection syntax
    • dbt run --select users --select users_rollup would be equivalent to dbt run --select users users_rollup
  2. With a set intersection
    • chaining uses a logical AND
    • intersection operator in dbt selection syntax
    • dbt run --select users --select users_rollup would be equivalent to dbt run --select users,users_rollup

Pros of set union

  • Aligns with the union within the dbt selection syntax
    • dbt run --select users --select users_rollup is equivalent to dbt run --select users users_rollup
  • Users can still achieve intersections with this syntax
  • Can express slightly more complex selection on the command line before needing YAML selectors

Cons of set union

  • Surprising if you are expecting a logical AND like method chaining/cascading with Pandas DataFrames or piping in Unix systems.

Pros of set intersection

  • Behaves similar to method chaining/cascading using dataframes like:
    df
      .filter(f1)
      .filter(f2)
    

Cons of set intersection

  • Surprising if you are expecting a logical AND

Proposal

My belief is that a majority of dbt users will expect dbt list -s this that and dbt list -s this -s that to be equivalent, so I'm in favor of set union.

Copy link
Member

@aranke aranke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@dbeatty10 dbeatty10 merged commit c3c2b27 into main Mar 28, 2023
@dbeatty10 dbeatty10 deleted the dbeatty/7158-multiple-multioption branch March 28, 2023 16:50
dbeatty10 added a commit to dbt-labs/docs.getdbt.com that referenced this pull request Apr 3, 2023
resolves #3003

## What are you changing in this pull request and why?
See dbt-labs/dbt-core#7169 in addition to the
issues this resolves.

## Pages updated

- [Upgrading to
v1.5](https://deploy-preview-3093--docs-getdbt-com.netlify.app/guides/migration/versions/upgrading-to-v1.5#breaking-changes)
- [YAML
Selectors](https://deploy-preview-3093--docs-getdbt-com.netlify.app/reference/node-selection/yaml-selectors#exclude)
- [How does selection
work?](https://deploy-preview-3093--docs-getdbt-com.netlify.app/reference/node-selection/syntax#how-does-selection-work)

## 🎩 

[Netlify
preview](https://deploy-preview-3093--docs-getdbt-com.netlify.app/reference/node-selection/yaml-selectors#exclude)

### 1.4

<img width="500" alt="image"
src="https://user-images.githubusercontent.com/44704949/228322779-8b2f4d69-17a5-4326-8ced-7add389ebea0.png">

### 1.5

<img width="500" alt="image"
src="https://user-images.githubusercontent.com/44704949/228322883-08443f70-ea22-43db-a190-728b6e2e3b3c.png">

## Checklist
Uncomment if you're publishing docs for a prerelease version of dbt
(delete if not applicable):
- [x] Add versioning components, as described in [Versioning
Docs](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/versioningdocs.md)
- [x] Add a note to the prerelease version [Migration
Guide](https://github.com/dbt-labs/docs.getdbt.com/tree/current/website/docs/guides/migration/versions)
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
and [About
versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
so my content adheres to these guidelines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
4 participants