CT 1808 diff based partial parsing #6873

gshank · 2023-02-06T13:53:46Z

resolves #6592

Description

This is at the proof of concept stage, for discussion purposes. There are a number of edge cases and questions to resolve before being ready to officially review.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

jtcohen6

Great start to discussion! Next step: pull together a group of colleagues to talk much more about this :)

jtcohen6 · 2023-02-07T10:07:41Z

core/dbt/parser/read_files.py

+        # For now, we assume that all files are in the root_project, until
+        # we've determined whether project name will be provided or deduced
+        # from the directory.


Ultimately, I think it would make sense to have a separate previous step where we resolve Project config, for the root project and all installed packages, and can use that to determine the project name. For now, good to call this out as a known limitation of the proof-of-concept approach.

As it turns out, the only other local files besides the root project are local dependencies, and there currently is no way to connect up a local package with its directory, and in addition local dependency "files" are generated off of where they're copied by deps. This issue has not been resolved, because it's tied up in the question about what to do about deps.

jtcohen6 · 2023-02-07T10:08:43Z

core/dbt/parser/read_files.py

+                    if dfy:
+                        validate_yaml(source_file.path.original_file_path, dfy)
+                        source_file.dfy = dfy
+                    # TODO: ensure we have a file object even for empty files, such as schema files


For my understanding - this is defensive code in the case that we've been handed an empty yaml file?

The concern with empty files is whether those providing the file would consider it a new or a changed file when an empty file becomes non-empty. It feels more consistent to me that it would be a changed file, in which case we should change the dbt-core behavior to include empty files in the files dictionary, and handle the emptiness in parsing.

What if we leave the empty part to where we figure out what are the file diffs? Would that leads to simpler code on the parsing side?
I think it is reasonable to thing the input to parsing could be a manifest + file diff in the following format

{ "updated":{ "file1": "full content of file1", "file2": "full content of file2", ... }, "added":{ "file3": "full content of file3", "file4": "full content of file4", ... }, "deleted": ["file4", "file5"], }

updated and added could also be merged into one dictionary.

From dat-core's perspective an empty file would be the same as that file doesn't exist

This is the format that we're using now -- are you suggesting something different?

The empty file thing only applies to schema files, and we can't depend on it being simply an empty string -- it might have comments or whitespace, etc. I've changed not storing a schema file if it doesn't reduce to a real dictionary.

I think I missed the fact that this is just for schema files. And the comments/whitespace are good call out!

jtcohen6 · 2023-02-07T10:13:58Z

core/dbt/parser/read_files.py

+                project_root=self.all_projects[project_name].project_root,
+            )
+
+            # Now use the extension and "searched_path" to determine which file_type


Good callout here as well. The file extension feels like a perfectly reasonable way of determining file type, but it does start feeling a bit odd in a world without files. We might want the option of passing extension/language as an explicit argument.

gshank · 2023-02-07T17:05:36Z

A minor nit is that currently we do a checksum on the content of the file, then do a "strip" and store it. To keep the checksums from changing we should either do the checksum after the strip or not do the strip. My inclination is to do the checksum after the strip. If we start storing empty files, that would be slightly easier.

tests/functional/partial_parsing/test_file_diff.py

directories (not done)

gshank · 2023-02-08T19:46:58Z

I put a little thought into how we would handle non-base project updates, and I think we have some open questions about how this will work with regard to "deps". The "deps" command takes both remote and local dependencies and copies them into the "dbt_packages" directory. Would these files have to be reflected back to the UI server which is providing the file_diff? Or would these files be "installed" only on the remote server which is accepting the file diffs? So would dependencies be local to the UI server or local to the remote dbt server? If the "deps" command is processing a local dependency, would the ui server need to know that and provide all of those files, in a different kind of interface?

gshank · 2023-02-09T17:10:06Z

There is currently nothing to connect a local dependency directory to the project that it produces, so you can't go from the PackageSpec (which is the only place that the local dependency shows up) to the Project, or from the Project to the local dependency. This would be fairly easy to fix by either copying the package name into the package spec or the source directory into the project.

But it does point out that the actual file/node objects in the manifest for a local dependency all point to the files that have been copied into dbt_packages. It would certainly be possible to parse a local_dependency in place instead, that way updates to those files in the remote repo could by synced, and we could even do a single project deps update when detecting those changes. Whether any of those possibilities are a good idea is an open question.

It would also be possible, once the package spec is connected to the project (or project name, at least), to connect file updates by path to the existing file objects in the dbt_package directory. So an update to "local_dependency/models/my_model.sql" would actually update a file "local_dep://models/my_model.sql". That would automatically cause those parts of the local dependency to update, but would not capture changes to the dbt_project.yml for the dependency, which would require loading and rebuilding the Project for that dependency and if it has changed force a reparse.

gshank · 2023-03-03T17:41:25Z

core/dbt/parser/partial.py

@@ -73,7 +73,6 @@ def __init__(self, saved_manifest: Manifest, new_files: MutableMapping[str, AnyS
        self.project_parser_files: Dict = {}
        self.saved_files = self.saved_manifest.files
        self.project_parser_files = {}
-        self.deleted_manifest = Manifest()


I removed the deleted_manifest because it's not necessary. Originally I thought it might be helpful for diagnosing, but it wasn't.

gshank · 2023-03-03T17:43:11Z

core/dbt/parser/read_files.py

-        file_contents = load_file_contents(path.absolute_path, strip=False)
-        source_file.checksum = FileHash.from_contents(file_contents)
-        source_file.contents = file_contents.strip()
+        # We strip the file_contents before generating the checksum because we want


This is a change in behavior. It felt really wrong that we couldn't actually generate the checksum from the stored file contents, since it was done on the file contents prior to the contents being stripped and stored.

The alternative would have been to store the unstripped version of the contents.

Makes sense to me.

gshank · 2023-03-04T15:07:11Z

Although we aren't moving forward with more complete plans for file diff parsing, this pull request contains some refactoring which we don't want to lose, and it can serve as a base for using file_diff parsing in our tests, as a possible performance improvement and additional proof-of-concept.

peterallenwebb

Generally looks good and makes sense as far as I can tell, but I think we should clean up the unused event.

peterallenwebb · 2023-03-07T19:37:41Z

core/dbt/events/types.py

-
-    def message(self) -> str:
-        return f"Error retrieving modification time for file {self.path}"
+# Skipped Z004


It doesn't look like you removed this event from types.proto.

pfft. Good catch. Have removed.

peterallenwebb · 2023-03-07T20:10:37Z

core/dbt/parser/read_files.py

-        file_contents = load_file_contents(path.absolute_path, strip=False)
-        source_file.checksum = FileHash.from_contents(file_contents)
-        source_file.contents = file_contents.strip()
+        # We strip the file_contents before generating the checksum because we want


Makes sense to me.

peterallenwebb · 2023-03-07T20:13:29Z

core/dbt/parser/read_files.py

-        dbt_ignore_spec,
-    )
+@dataclass
+class ReadFilesFromFileSystem:


Just generally, more typing would have helped me understand more of this file more quickly. That said, I think the best way to capture shared behavior between classes like ReadFilesFromFileSystem and ReadFilesFromDiff is by defining a Protocol. But Protocols were introduced in Python 3.8, so we have a few more months without them.

We do use protocols in a few places already, so I think there are compatibility things for before 3.8.

* ct-2198: clean up some type names and uses * CT-2198: Unify constraints and constraints_check properties on columns * Make mypy version consistently 0.981 (#7134) * CT 1808 diff based partial parsing (#6873) * model contracts on models materialized as views (#7120) * first pass * rename tests * fix failing test * changelog * fix functional test * Update core/dbt/parser/base.py * Update core/dbt/parser/schemas.py * Create method for env var deprecation (#7086) * update to allow adapters to change model name resolution in py models (#7115) * update to allow adapters to change model name resolution in py models * add changie * fix newline adds * move quoting into macro * use single quotes * add env DBT_PROJECT_DIR support #6078 (#6659) Co-authored-by: Jeremy Cohen <[email protected]> * Add new index.html and changelog yaml files from dbt-docs (#7141) * Make version configs optional (#7060) * [CT-1584] New top level commands: interactive compile (#7008) Co-authored-by: Github Build Bot <[email protected]> * CT-2198: Add changelog entry * CT-2198: Fix tests which broke after merge * CT-2198: Add explicit validation of constraint types w/ unit test * CT-2198: Move access property, per code review * CT-2198: Remove a redundant macro * CT-1298: Rework constraints to be adapter-generated in Python code * CT-2198: Clarify function name per review --------- Co-authored-by: Gerda Shank <[email protected]> Co-authored-by: Emily Rockman <[email protected]> Co-authored-by: Stu Kilgore <[email protected]> Co-authored-by: colin-rogers-dbt <[email protected]> Co-authored-by: Leo Schick <[email protected]> Co-authored-by: Jeremy Cohen <[email protected]> Co-authored-by: FishtownBuildBot <[email protected]> Co-authored-by: dave-connors-3 <[email protected]> Co-authored-by: Kshitij Aranke <[email protected]> Co-authored-by: Github Build Bot <[email protected]>

gshank added 7 commits February 3, 2023 09:04

Remove unnecessary deleted_manifest in partial parsing

c9064e9

Switch existing file reading to use ReadFilesFromFileSystem

521c858

minor cleanup

e9d7469

A bunch of code in ReadFilesFromDiff

97498d6

initial test working

d73e500

add another model to test

8f52b31

Changie

88f1982

gshank requested review from a team as code owners February 6, 2023 13:53

gshank requested review from emmyoop and Fleid February 6, 2023 13:53

cla-bot bot added the cla:yes label Feb 6, 2023

gshank requested review from jtcohen6 and lostmygithubaccount February 6, 2023 13:54

gshank marked this pull request as draft February 6, 2023 13:54

jtcohen6 reviewed Feb 7, 2023

View reviewed changes

jtcohen6 mentioned this pull request Feb 7, 2023

[CT-2050] Use of "path" and "original_file_path" in nodes is confused #6879

Open

ChenyuLInx reviewed Feb 7, 2023

View reviewed changes

tests/functional/partial_parsing/test_file_diff.py Show resolved Hide resolved

gshank added 2 commits February 8, 2023 09:36

minor updates

5d028c0

Add some dependencies to test_file_diff.py to test finding project

2846e49

directories (not done)

gshank and others added 4 commits February 10, 2023 09:36

Get information on packages, not working...

439b449

Merge branch 'before_merge' into ct-1808-diff_based_partial_parsing

204511f

Merge branch 'main' into ct-1808-diff_based_partial_parsing

9b008e4

Add generated CLI API docs

a5aac4a

gshank closed this Feb 10, 2023

gshank reopened this Feb 10, 2023

jtcohen6 mentioned this pull request Feb 17, 2023

[CT-1535] Investigate performance of functional tests #6289

Closed

gshank added 4 commits March 3, 2023 10:14

Merge branch 'main' into ct-1808-diff_based_partial_parsing

f7991f2

Fix merge errors

6dd6d7c

Fix checksums in artifacts to match change in strip

dd1bd92

Remove reference to deleted_manifest

d57e862

gshank commented Mar 3, 2023

View reviewed changes

Remove unused code exploring local dependency files

ce1b3ee

gshank marked this pull request as ready for review March 4, 2023 15:04

gshank requested review from peterallenwebb and MichelleArk and removed request for Fleid and lostmygithubaccount March 4, 2023 15:05

peterallenwebb approved these changes Mar 7, 2023

View reviewed changes

Finish removing unused log event

c764c9f

gshank merged commit 8d98752 into main Mar 7, 2023

gshank deleted the ct-1808-diff_based_partial_parsing branch March 7, 2023 21:37

peterallenwebb pushed a commit that referenced this pull request Mar 14, 2023

CT 1808 diff based partial parsing (#6873)

95020ec

dbeatty10 mentioned this pull request May 19, 2024

Undocumented one-off environment variables dbt-labs/docs.getdbt.com#4678

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CT 1808 diff based partial parsing #6873

CT 1808 diff based partial parsing #6873

gshank commented Feb 6, 2023

jtcohen6 left a comment

jtcohen6 Feb 7, 2023

gshank Mar 3, 2023

jtcohen6 Feb 7, 2023

gshank Feb 7, 2023

ChenyuLInx Feb 7, 2023

gshank Feb 8, 2023

ChenyuLInx Feb 8, 2023

jtcohen6 Feb 7, 2023

gshank commented Feb 7, 2023

gshank commented Feb 8, 2023

gshank commented Feb 9, 2023 •

edited

Loading

gshank Mar 3, 2023

gshank Mar 3, 2023

peterallenwebb Mar 7, 2023

gshank commented Mar 4, 2023

peterallenwebb left a comment

peterallenwebb Mar 7, 2023

gshank Mar 7, 2023

peterallenwebb Mar 7, 2023

peterallenwebb Mar 7, 2023

gshank Mar 7, 2023

CT 1808 diff based partial parsing #6873

CT 1808 diff based partial parsing #6873

Conversation

gshank commented Feb 6, 2023

Description

Checklist

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshank commented Feb 7, 2023

gshank commented Feb 8, 2023

gshank commented Feb 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshank commented Mar 4, 2023

peterallenwebb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gshank commented Feb 9, 2023 •

edited

Loading