-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1808] [Spike] "diff"-based entry point into partial parsing #6592
Comments
@jtcohen6 I talked to Gerda the other day and she mentioned the task to support a different entry point/interface to make |
@ChenyuLInx Sounds great! I snuck this point in at the bottom, but I think it's pretty important, and worth clarifying with other teams:
Update: Chatted with @gshank, it sounds like these methods would indeed serialize the full manifest (including parsing/file-related methods) to msgpack. I was thinking about |
Another consideration: If there's a need to trigger a full re-parse, e.g. because of a change in
Update: Chatted with @gshank. We have the full raw file contents in the serialized manifest (and the dictionary version of Related to dbt-labs/dbt-server#128 (comment), in that we might still want a way to expose/share the logic about when a full re-parse is necessary. |
Would need some special cases for changes to We also don't store seed contents in the manifest today. We could start storing seeds below a certain size limit. We generally document <1 MB as the officially supported size. Or, the service wrapping Scope for this issue: accept programmatic payload (like the one above), for project file contents (excluding seeds), and plumbing it into partial parsing New issues:
|
Opened #6777 to track known edge cases, which will be out of scope for the "happy path" that we're pursuing in this first round of work |
Note: we need to know the project name for the files passed in, because we parse all projects including dependencies, not just the base project. |
I guess it would be possible to infer the project name from the complete path from the root of the project, i.e. from the directory name underneath the "dbt_packages" directory. |
@gshank This is because we use the project name + relative path to construct the In addition to the work we've scoped in this issue, we might think about a slim subset of
That would allow us to establish a clear interface between:
It should be the responsibility of another application to detect source file changes (e.g. using
For each of the "dbt-core" line items above, we would want a standalone library function with clear inputs & outputs, expecting that each would be called & coordinated by another application/service. |
Right now I'm inferring parsing type (modesl/seeds/macros etc) from parsing the path of the file, so it would be somewhat consistent to infer the package. Whether that's the right way to do it is certainly an open question. The alternative would be to separate out the information necessary to determine how a particular file slots into a dbt project, which is information that might in the end be very useful for other reasons. That would include all of the various "paths" from the project (model_paths, seed_paths, etc). Plus presumably information about where additional project are stored. |
I agree, this could be useful for many reasons. Another one could be (eventually) the ability to slim down the package contents installed by I think this would require us to formally split out Here's what I'm envisioning:
cc @ChenyuLInx @MichelleArk - this is all good stuff to talk about more with our friends |
See dbt-labs/dbt-server#128
The gist of current partial parsing, as I understand it:
--vars
, env vars, target/profile, etc (build_manifest_state_check
)Rather than requiring
dbt-core
to inspect the full relevant file system, could we provide some sort of interface for saying, "These files have been added / modified / deleted," and skip straight to step 3? (This assumes that the check in step 1 has already taken place, even if the logic for it needs to live elsewhere.)The potential data structure for this "diff" could (but doesn't have to) match up closely with the
file_diff
object that partial parsing constructs + uses today. Three thoughts:file_diff
distinguishes between "schema files" (.yml
) and other filesfile_diff
is just a set offile_id
pointers, not the actual content of those files — that could be sufficient here too, but the most compelling version of this capability also enables us to pass in file contents programmaticallydbt-core
's internalfile_id
as the unique identifier here (project_name://subdir/path/file.ext
), it would require an external process to know the project'sname
(defined indbt_project.yml
, which might be a faulty assumption.Imagining something like:
Programmatically, we'd call a
parse
with pointers to both:partial_parse.msgpack
, not justmanifest.msgpack
, given the need for all the between-file links that power partial parsingThe text was updated successfully, but these errors were encountered: