[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

jtcohen6 · 2023-01-12T09:18:37Z

The gist of current partial parsing, as I understand it:

Perform a number of checks to determine if a full re-parse is needed, or which files need to be re-parsed on account of changes to --vars, env vars, target/profile, etc (build_manifest_state_check)
If partial parsing is possible: inspect the full relevant file system, using system timestamps + file hashes to detect changes to source files
Based on files changed: schedule nodes defined in those files (+ other nodes related to those nodes + other files affecting those nodes) for re-parsing

Rather than requiring dbt-core to inspect the full relevant file system, could we provide some sort of interface for saying, "These files have been added / modified / deleted," and skip straight to step 3? (This assumes that the check in step 1 has already taken place, even if the logic for it needs to live elsewhere.)

The potential data structure for this "diff" could (but doesn't have to) match up closely with the file_diff object that partial parsing constructs + uses today. Three thoughts:

The current file_diff distinguishes between "schema files" (.yml) and other files
The current file_diff is just a set of file_id pointers, not the actual content of those files — that could be sufficient here too, but the most compelling version of this capability also enables us to pass in file contents programmatically
While it would be great to use dbt-core's internal file_id as the unique identifier here (project_name://subdir/path/file.ext), it would require an external process to know the project's name (defined in dbt_project.yml, which might be a faulty assumption.

Imagining something like:

{
  "added": [
    {
        "path": "models/new_model.sql",
        "content": "... lots of jinja + sql ...",
    },
    {
        "path": "models/docs.md",
        "content": "... some markdown ...",
    },
  ],
  "changed": [
    {
        "path": "models/my_py_model.py",
        "content": "... some python ...",
    },
  ],
  "deleted": [
    "models/other.sql",
    "models/other/schema.yml",
  ],
}

Programmatically, we'd call a parse with pointers to both:

existing manifest — although I think this would need to be partial_parse.msgpack, not just manifest.msgpack, given the need for all the between-file links that power partial parsing
"diff," either in a file or passed in as a data structure from memory

The text was updated successfully, but these errors were encountered:

ChenyuLInx · 2023-01-17T23:51:17Z

@jtcohen6 I talked to Gerda the other day and she mentioned the task to support a different entry point/interface to make manifest + file updated = new manifest can be done as a task instead of spiking ahead. I think we can just figure out exactly what is the data structure of file update/diffs we are going to get, and start the task without the spike.

jtcohen6 · 2023-01-20T09:31:13Z

@ChenyuLInx Sounds great!

I snuck this point in at the bottom, but I think it's pretty important, and worth clarifying with other teams:

existing manifest — although I think this would need to be partial_parse.msgpack, not just manifest.msgpack, given the need for all the between-file links that power partial parsing

Based on my current understanding, this would need to be the manifest used by partial parsing (partial_parse.msgpack), which has additional information for resolving inter-file dependencies. We'd need dbt-server to manage that state, in addition to the "execution" manifest. Put another way, the state input to (re-)parsing and the state input to execution might need to be slightly different.

Update: Chatted with @gshank, it sounds like these methods would indeed serialize the full manifest (including parsing/file-related methods) to msgpack. I was thinking about WriteableManifest.

jtcohen6 · 2023-01-23T09:50:58Z

Another consideration: If there's a need to trigger a full re-parse, e.g. because of a change in dbt_project.yml or --vars passed on in a command, dbt-core will once again need access to full set of source files.

If we only apply the diffs to existing "parsing" manifest, but don't actually apply those diffs to a backing set of source files, how would we be able to fall "back" to the full set of source files for a full re-parse? In those cases, would it be acceptable to trigger a full file re-sync?

Update: Chatted with @gshank. We have the full raw file contents in the serialized manifest (and the dictionary version of .yml files). If there's a change that requires a full re-parse (e.g. dbt_project.yml), we'd simply schedule every "file" for re-parsing, but we wouldn't need to actually go back to the source file system.

Related to dbt-labs/dbt-server#128 (comment), in that we might still want a way to expose/share the logic about when a full re-parse is necessary.

jtcohen6 · 2023-01-26T16:53:19Z

Would need some special cases for changes to dbt_project.yml and profiles.yml (or when their values depend on --vars). Let's open a separate issue to track that work, since it would need to be resolved before we get into parsing.

We also don't store seed contents in the manifest today. We could start storing seeds below a certain size limit. We generally document <1 MB as the officially supported size. Or, the service wrapping dbt-core need to store & sync seed contents by other means.

Scope for this issue: accept programmatic payload (like the one above), for project file contents (excluding seeds), and plumbing it into partial parsing

New issues:

changes to dbt_project.yml + profiles.yml (whether edited explicitly, or because they depend on --vars or env vars that have changed)
changes to seeds

jtcohen6 · 2023-01-30T08:54:17Z

Opened #6777 to track known edge cases, which will be out of scope for the "happy path" that we're pursuing in this first round of work

gshank · 2023-02-03T14:25:53Z

Note: we need to know the project name for the files passed in, because we parse all projects including dependencies, not just the base project.

gshank · 2023-02-03T18:41:49Z

I guess it would be possible to infer the project name from the complete path from the root of the project, i.e. from the directory name underneath the "dbt_packages" directory.

jtcohen6 · 2023-02-04T10:19:51Z

@gshank This is because we use the project name + relative path to construct the file_id, right?

In addition to the work we've scoped in this issue, we might think about a slim subset of dbt-core functionality that can, given resolved config for root-level project and dependencies:

"discover" the file system, and detect which are needed source files for dbt
(possibly) for those relevant source files, return a mapping of file_id + raw contents

That would allow us to establish a clear interface between:

Gathering project contents (file system → data)
Receiving project contents (data, not files) → parsing → manifest

It should be the responsibility of another application to detect source file changes (e.g. using git), and to identify the overlap between (a) changed files & (b) actual dbt source files. Ultimately, the sequence for updating project state could look like:

(not dbt-core) detect changed files
(dbt-core) discover which files are relevant to dbt
(dbt-core) assemble "diff" (data structure)
(not dbt-core) send "diff" (data)
(not dbt-core) receive "diff" (data), access previous manifest
(dbt-core) given "diff" + previous manifest, parse + return new manifest

For each of the "dbt-core" line items above, we would want a standalone library function with clear inputs & outputs, expecting that each would be called & coordinated by another application/service.

gshank · 2023-02-06T14:42:40Z

Right now I'm inferring parsing type (modesl/seeds/macros etc) from parsing the path of the file, so it would be somewhat consistent to infer the package. Whether that's the right way to do it is certainly an open question. The alternative would be to separate out the information necessary to determine how a particular file slots into a dbt project, which is information that might in the end be very useful for other reasons. That would include all of the various "paths" from the project (model_paths, seed_paths, etc). Plus presumably information about where additional project are stored.

jtcohen6 · 2023-02-06T14:56:19Z

The alternative would be to separate out the information necessary to determine how a particular file slots into a dbt project, which is information that might in the end be very useful for other reasons. That would include all of the various "paths" from the project (model_paths, seed_paths, etc). Plus presumably information about where additional project are stored.

I agree, this could be useful for many reasons. Another one could be (eventually) the ability to slim down the package contents installed by deps; today it's always a full git repo, including a lot of extraneous non-dbt files.

I think this would require us to formally split out Project config, as a thing that can be resolved first, and then use that as the input to "file discovery." Thanks to API-ification, we already have a clearer demarcation for this: @requires.project returns the Project config, and is then used as an input to @requires.manifest (requires.py).

Here's what I'm envisioning:

First, resolve Project + Profile from dbt_project.yml + profiles.yml — either by running dbt-core directly inside the file system, or by sending raw file contents to wherever dbt-core is running, and sending back the resolved configs as data
Use those resolved configs to "discover" the file system, and create FileDiff (= data), including information about resource types and project_name.

cc @ChenyuLInx @MichelleArk - this is all good stuff to talk about more with our friends

jtcohen6 added partial_parsing spike Team:Language labels Jan 12, 2023

github-actions bot changed the title ~~[Spike] "diff"-based entry point into partial parsing~~ [CT-1808] [Spike] "diff"-based entry point into partial parsing Jan 12, 2023

jtcohen6 mentioned this issue Jan 12, 2023

[CT-1581] [Epic] dbt-core as a library: first steps #6356

Closed

23 tasks

jtcohen6 mentioned this issue Jan 30, 2023

[CT-1945] [Spike] Edge cases for partial manifest updates #6777

Closed

leahwicz assigned gshank Jan 31, 2023

gshank mentioned this issue Feb 6, 2023

CT 1808 diff based partial parsing #6873

Merged

6 tasks

jtcohen6 added the python_api Issues related to dbtRunner Python entry point label Feb 6, 2023

jtcohen6 added the file_system How dbt-core interoperates with file systems to read/write data label Feb 10, 2023

gshank closed this as completed in #6873 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

jtcohen6 commented Jan 12, 2023

ChenyuLInx commented Jan 17, 2023

jtcohen6 commented Jan 20, 2023 •

edited

Loading

jtcohen6 commented Jan 23, 2023 •

edited

Loading

jtcohen6 commented Jan 26, 2023 •

edited

Loading

jtcohen6 commented Jan 30, 2023

gshank commented Feb 3, 2023

gshank commented Feb 3, 2023

jtcohen6 commented Feb 4, 2023

gshank commented Feb 6, 2023

jtcohen6 commented Feb 6, 2023

[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592

Comments

jtcohen6 commented Jan 12, 2023

ChenyuLInx commented Jan 17, 2023

jtcohen6 commented Jan 20, 2023 • edited Loading

jtcohen6 commented Jan 23, 2023 • edited Loading

jtcohen6 commented Jan 26, 2023 • edited Loading

jtcohen6 commented Jan 30, 2023

gshank commented Feb 3, 2023

gshank commented Feb 3, 2023

jtcohen6 commented Feb 4, 2023

gshank commented Feb 6, 2023

jtcohen6 commented Feb 6, 2023

jtcohen6 commented Jan 20, 2023 •

edited

Loading

jtcohen6 commented Jan 23, 2023 •

edited

Loading

jtcohen6 commented Jan 26, 2023 •

edited

Loading