Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

Closed
jtcohen6 opened this issue Jan 12, 2023 · 10 comments · Fixed by #6873
Closed

[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

jtcohen6 opened this issue Jan 12, 2023 · 10 comments · Fixed by #6873
Assignees
Labels
file_system How dbt-core interoperates with file systems to read/write data partial_parsing python_api Issues related to dbtRunner Python entry point spike

Comments

@jtcohen6
Copy link
Contributor

See dbt-labs/dbt-server#128

The gist of current partial parsing, as I understand it:

  1. Perform a number of checks to determine if a full re-parse is needed, or which files need to be re-parsed on account of changes to --vars, env vars, target/profile, etc (build_manifest_state_check)
  2. If partial parsing is possible: inspect the full relevant file system, using system timestamps + file hashes to detect changes to source files
  3. Based on files changed: schedule nodes defined in those files (+ other nodes related to those nodes + other files affecting those nodes) for re-parsing

Rather than requiring dbt-core to inspect the full relevant file system, could we provide some sort of interface for saying, "These files have been added / modified / deleted," and skip straight to step 3? (This assumes that the check in step 1 has already taken place, even if the logic for it needs to live elsewhere.)

The potential data structure for this "diff" could (but doesn't have to) match up closely with the file_diff object that partial parsing constructs + uses today. Three thoughts:

  • The current file_diff distinguishes between "schema files" (.yml) and other files
  • The current file_diff is just a set of file_id pointers, not the actual content of those files — that could be sufficient here too, but the most compelling version of this capability also enables us to pass in file contents programmatically
  • While it would be great to use dbt-core's internal file_id as the unique identifier here (project_name://subdir/path/file.ext), it would require an external process to know the project's name (defined in dbt_project.yml, which might be a faulty assumption.

Imagining something like:

{
  "added": [
    {
        "path": "models/new_model.sql",
        "content": "... lots of jinja + sql ...",
    },
    {
        "path": "models/docs.md",
        "content": "... some markdown ...",
    },
  ],
  "changed": [
    {
        "path": "models/my_py_model.py",
        "content": "... some python ...",
    },
  ],
  "deleted": [
    "models/other.sql",
    "models/other/schema.yml",
  ],
}

Programmatically, we'd call a parse with pointers to both:

  1. existing manifest — although I think this would need to be partial_parse.msgpack, not just manifest.msgpack, given the need for all the between-file links that power partial parsing
  2. "diff," either in a file or passed in as a data structure from memory
@github-actions github-actions bot changed the title [Spike] "diff"-based entry point into partial parsing [CT-1808] [Spike] "diff"-based entry point into partial parsing Jan 12, 2023
@ChenyuLInx
Copy link
Contributor

@jtcohen6 I talked to Gerda the other day and she mentioned the task to support a different entry point/interface to make manifest + file updated = new manifest can be done as a task instead of spiking ahead. I think we can just figure out exactly what is the data structure of file update/diffs we are going to get, and start the task without the spike.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jan 20, 2023

@ChenyuLInx Sounds great!

I snuck this point in at the bottom, but I think it's pretty important, and worth clarifying with other teams:

existing manifest — although I think this would need to be partial_parse.msgpack, not just manifest.msgpack, given the need for all the between-file links that power partial parsing

Based on my current understanding, this would need to be the manifest used by partial parsing (partial_parse.msgpack), which has additional information for resolving inter-file dependencies. We'd need dbt-server to manage that state, in addition to the "execution" manifest. Put another way, the state input to (re-)parsing and the state input to execution might need to be slightly different.

Update: Chatted with @gshank, it sounds like these methods would indeed serialize the full manifest (including parsing/file-related methods) to msgpack. I was thinking about WriteableManifest.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jan 23, 2023

Another consideration: If there's a need to trigger a full re-parse, e.g. because of a change in dbt_project.yml or --vars passed on in a command, dbt-core will once again need access to full set of source files.

If we only apply the diffs to existing "parsing" manifest, but don't actually apply those diffs to a backing set of source files, how would we be able to fall "back" to the full set of source files for a full re-parse? In those cases, would it be acceptable to trigger a full file re-sync?

Update: Chatted with @gshank. We have the full raw file contents in the serialized manifest (and the dictionary version of .yml files). If there's a change that requires a full re-parse (e.g. dbt_project.yml), we'd simply schedule every "file" for re-parsing, but we wouldn't need to actually go back to the source file system.

Related to dbt-labs/dbt-server#128 (comment), in that we might still want a way to expose/share the logic about when a full re-parse is necessary.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jan 26, 2023

Would need some special cases for changes to dbt_project.yml and profiles.yml (or when their values depend on --vars). Let's open a separate issue to track that work, since it would need to be resolved before we get into parsing.

We also don't store seed contents in the manifest today. We could start storing seeds below a certain size limit. We generally document <1 MB as the officially supported size. Or, the service wrapping dbt-core need to store & sync seed contents by other means.

Scope for this issue: accept programmatic payload (like the one above), for project file contents (excluding seeds), and plumbing it into partial parsing

New issues:

  • changes to dbt_project.yml + profiles.yml (whether edited explicitly, or because they depend on --vars or env vars that have changed)
  • changes to seeds

@jtcohen6
Copy link
Contributor Author

Opened #6777 to track known edge cases, which will be out of scope for the "happy path" that we're pursuing in this first round of work

@gshank
Copy link
Contributor

gshank commented Feb 3, 2023

Note: we need to know the project name for the files passed in, because we parse all projects including dependencies, not just the base project.

@gshank
Copy link
Contributor

gshank commented Feb 3, 2023

I guess it would be possible to infer the project name from the complete path from the root of the project, i.e. from the directory name underneath the "dbt_packages" directory.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Feb 4, 2023

@gshank This is because we use the project name + relative path to construct the file_id, right?

In addition to the work we've scoped in this issue, we might think about a slim subset of dbt-core functionality that can, given resolved config for root-level project and dependencies:

  • "discover" the file system, and detect which are needed source files for dbt
  • (possibly) for those relevant source files, return a mapping of file_id + raw contents

That would allow us to establish a clear interface between:

  1. Gathering project contents (file system → data)
  2. Receiving project contents (data, not files) → parsing → manifest

It should be the responsibility of another application to detect source file changes (e.g. using git), and to identify the overlap between (a) changed files & (b) actual dbt source files. Ultimately, the sequence for updating project state could look like:

  • (not dbt-core) detect changed files
  • (dbt-core) discover which files are relevant to dbt
  • (dbt-core) assemble "diff" (data structure)
  • (not dbt-core) send "diff" (data)
  • (not dbt-core) receive "diff" (data), access previous manifest
  • (dbt-core) given "diff" + previous manifest, parse + return new manifest

For each of the "dbt-core" line items above, we would want a standalone library function with clear inputs & outputs, expecting that each would be called & coordinated by another application/service.

@jtcohen6 jtcohen6 added the python_api Issues related to dbtRunner Python entry point label Feb 6, 2023
@gshank
Copy link
Contributor

gshank commented Feb 6, 2023

Right now I'm inferring parsing type (modesl/seeds/macros etc) from parsing the path of the file, so it would be somewhat consistent to infer the package. Whether that's the right way to do it is certainly an open question. The alternative would be to separate out the information necessary to determine how a particular file slots into a dbt project, which is information that might in the end be very useful for other reasons. That would include all of the various "paths" from the project (model_paths, seed_paths, etc). Plus presumably information about where additional project are stored.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Feb 6, 2023

The alternative would be to separate out the information necessary to determine how a particular file slots into a dbt project, which is information that might in the end be very useful for other reasons. That would include all of the various "paths" from the project (model_paths, seed_paths, etc). Plus presumably information about where additional project are stored.

I agree, this could be useful for many reasons. Another one could be (eventually) the ability to slim down the package contents installed by deps; today it's always a full git repo, including a lot of extraneous non-dbt files.

I think this would require us to formally split out Project config, as a thing that can be resolved first, and then use that as the input to "file discovery." Thanks to API-ification, we already have a clearer demarcation for this: @requires.project returns the Project config, and is then used as an input to @requires.manifest (requires.py).

Here's what I'm envisioning:

  1. First, resolve Project + Profile from dbt_project.yml + profiles.yml — either by running dbt-core directly inside the file system, or by sending raw file contents to wherever dbt-core is running, and sending back the resolved configs as data
  2. Use those resolved configs to "discover" the file system, and create FileDiff (= data), including information about resource types and project_name.

cc @ChenyuLInx @MichelleArk - this is all good stuff to talk about more with our friends

@jtcohen6 jtcohen6 added the file_system How dbt-core interoperates with file systems to read/write data label Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
file_system How dbt-core interoperates with file systems to read/write data partial_parsing python_api Issues related to dbtRunner Python entry point spike
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants