Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds dbt bootstrap subcommand #1238

Conversation

mikekaminsky
Copy link
Contributor

@mikekaminsky mikekaminsky commented Jan 12, 2019

Addresses #1082

  • Introduces new dependency on oyaml
  • Needs tests
  • Works on postgres
  • Question: What will happen with late-binding views?

Example:

$ dbt bootstrap --schemas test --print-only --profiles-dir .

Running with dbt=0.13.0-a1
Bootstrapping the following schemas:
- test
--------------------
Design for relation: test.test_2
--------------------
version: 2
models:
- name: test_2
  description: 'TODO: Replace me'
  columns:
  - name: col_alpha
  - name: col_beta

--------------------
Design for relation: test.test
--------------------
version: 2
models:
- name: test
  description: 'TODO: Replace me'
  columns:
  - name: col_a
  - name: col_b

* Introduces new dependency on oyaml
* Needs tests
* Works on postgres
@mikekaminsky
Copy link
Contributor Author

mikekaminsky commented Jan 13, 2019

Started working on adding tests for this and am blocked because I'm not sure how to add the new dependency into the docker / tox build (so far I've tried to keep my nose out of the part of the dbt build world).

Currently can't get the tests to run at all because they die on ModuleNotFoundError: No module named 'oyaml'.

@drewbanin
Copy link
Contributor

Hey @mikekaminsky - thanks for making this PR!

This is an interesting and novel feature for dbt. Whereas dbt typically only runs the code defined in a dbt project, this will make dbt generate code! That's cool, powerful, and oft-requested to be sure. At the same time, so much of the complexity involved here is not technical in nature, but instead is a reflection of all the variability in how dbt users write and run their code.

There are a lot of questions to answer about how this should work:

  1. if dbt can generate schema.yml files, shouldn't it also generate base models?
  2. what is a command structure that sensibly extends to generating these base models? (I think bootstrap isn't quite it)
  3. Where do these files go?
    a. it looks to me like you're building a path of models/schema.yml -- this would only work if run from the root of the project
    b. should dbt try to incorporate model schemas into an existing schema.yml? Eg. if you start working on a model, generate a schema, and then make changes to the model SQL?
  4. this looks like it runs for all of the tables in a schema -- is that a typical workflow for folks that use schema.yml files?

These are all important questions, and they're the kinds of things I consider when speccing out new features. The good news is, I think we're well suited to answer them :)

Can you pause on this PR so we can discuss some of the details? This is the first foray into a whole new class of functionality in dbt, so I want to be super sure that we get it right!

Finally, while I'm super sold on the benefit of functionality like this, I'm not fully convinced that the code should live inside of dbt. Another way for this to work is by exposing an API that external scripts can consume. That would make it possible/easy to build whole suites of tools that specialize in exactly this type of code generation. Further, those types of tools can be iterated on with much greater frequency than we intend practice with dbt. For my part, I'll have a deep think about where code like this belongs.

Super happy to discuss when you have the time!

@mikekaminsky
Copy link
Contributor Author

mikekaminsky commented Jan 14, 2019

@drewbanin some quick hits on the easy questions

This is an interesting and novel feature for dbt. Whereas dbt typically only runs the code defined in a dbt project, this will make dbt generate code!

I'm not sure this is entirely true! dbt docs is definitely in the code-generating business :)

if dbt can generate schema.yml files, shouldn't it also generate base models?

Maybe(?) I'm less convinced that this is super critical. One of the nice things about this feature is that it works with models you've created in DBT (so you're working on the code in your new analytics schema and you want to bootstrap the design files).

This isn't just for getting started with "upstream" source tables.

Edit (2019-01-14 5:55pm): To be clear, if we want DBT to be able to add the base models, I think that would go in a separate sub-command.

what is a command structure that sensibly extends to generating these base models? (I think bootstrap isn't quite it)

I'm not sure what you're asking here -- do you just want to change the subcommand name from bootstrap to something else?

Where do these files go?
a. it looks to me like you're building a path of models/schema.yml -- this would only work if run from the root of the project

Fair point. I took a second to look around for a variable that would have the right file location in it but I couldn't find anything. Happy to update if we have a better way of identifying that location?

b. should dbt try to incorporate model schemas into an existing schema.yml? Eg. if you start working on a model, generate a schema, and then make changes to the model SQL?

I punted on this. Dealing with trying to update the yaml seemed like a PITA. I added the print-only feature so that you can at least get the output and then combine it with what's existing by-hand.

this looks like it runs for all of the tables in a schema -- is that a typical workflow for folks that use schema.yml files?

That's what I've wanted to do in the past (and that's how I wrote the GH issue). Would be really easy to add a --table selector if people want to do this one-by-one.


As to whether or not this should live in DBT ... that's obviously a tough question. I think yes, because this functionality is really tightly coupled with the particulars of DBT and fits well into the DBT workflow.

Maybe what you're suggesting is that you want to factor out the CLI / workflow components out of DBT and really have "DBT" only be the core model-running / testing code kernel. That's an interesting idea, but a pretty big departure from the paradigm today. If you wanted to move that direction, you'd probably pull dbt docs, dbt compile, dbt compare commands out of dbt and into dbt-tools (those are the commands that seem mostly like "helper" or "admin" commands that aren't directly related to the running of DBT).

Without doing extensive reflection, I'm not sure that's the right way to go in so far as it requires analysts / DBAs to learn two different tools for working with the database. However, maybe it's better to pull this apart sooner-rather-than-later in the interest of unix-style do-one-thing-really-well CLIs.

HMU on slack if you want to chat!

@drewbanin
Copy link
Contributor

Thanks for the comments - it's super clear to me that you gave this a lot of thought! I didn't enumerate my questions to pick apart your PR -- I'm super aware that this is a WIP and happy to work together on the specifics once we get the overall design sorted. Rather, I wanted to give you a sense for how I evaluate the complexity of features and indicate that I think the hard part of this PR is around UX and not technical feasibility.

I think the only point I'd contest from your response is:

I'm not sure this is entirely true! dbt docs is definitely in the code-generating business :)

I actually don't think this is true, and it's a super important distinction (and the basis for my current opinion on this feature!). The dbt docs generate command produces compiled assets -- these things are rendered into the target/ directory and are not typically version controlled. That's super different from a command which generates source code.

The operative difference is that people have strong opinions on how source code should be written, whereas they tend to be less opinionated on things like compiled assets. Whereas no one has strong feelings about the mess of compiled code that's in the rendered index.html file for docs, I imagine there are going to be tons of thoughts about how dbt should format these schema.yml files, where they should live, what gets included (columns, tests, etc).

It's good and reasonable that different teams have different preferences about how to structure these things, but I'm averse to the idea that dbt should 1) implement an opinion on the matter and 2) be the arbiter of that stylistic decision. This is in the class of PRs that I'm interested in reviewing exactly once, but I can totally imagine folks adding flags left and right for minor updates to this command's functionality. So: that's the big problem. I think that it's good and reasonable for folks to want to tweak this feature to their needs, but I'm opposed to having them implement those stylistic tweaks via dbt's PR process.

Ok, so how do we proceed from here? I want to paint a different picture of how this feature could be implemented that's more closely aligned with the core feature set of dbt. Imagine you created a macro that can generate a schema.yml spec for a given model, and also imagine that dbt had some mechanism to invoke macros dynamically from the command line, passing along any supplied CLI flags as macro arguments. That would make it possible to implement this feature, as well as tons of other related features, without actually modifying dbt code.

It would be pretty easy to extend this macro to sources or even to base models. Macros like these could be wrapped up in dbt packages, and folks could tweak them to suit their precise needs without needing to fork dbt.

To be sure, there are still some challenges associated with making dbt do exactly the thing described here, but these changes are very much directionally aligned with the changes we want to make to dbt long-term. As such, I find them way more compelling than one-off features that accomplish the same end goal. Whereas the operation + macro approach represents a doubling down on dbt's position as a code compiler, a new top-level command essentially becomes a maintenance burden!

I just threw a whole lot at you. I'm very curious to hear what you think about all of this both in regards to 1) the implementation of this particular feature and 2) the types of features that I think are well-suited to live in dbt. If you buy the approach, then I'm super happy to spec out the work that needs to happen for us to get there.

A meta-point

As dbt grows in popularity and complexity, it's going to be increasingly important for the Fishtown team to identify which issues we think can be picked up right away, and which ones should probably be discussed further before beginning implementation. I'm super glad you've been contributing code to dbt, and I definitely don't want to discourage that kind of behavior! For my part, I'll comb over our outstanding issues and tag the gnarlier ones with a talk to us first label or something like that. I'm interested in making community contributions a well-oiled machine, so definitely let me know if you have any thoughts on the matter!

@mikekaminsky
Copy link
Contributor Author

Closing this (potentially temporarily) while thinking through the future of dbt compare and dbt bootstrap (#1217)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants