Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get rid of the path for fully qualified names. #912

Conversation

eliax1996
Copy link
Contributor

@eliax1996 eliax1996 commented Jul 3, 2024

This is not complete normalization feature but rather more a fix for the fully qualified paths vs simple name references. This fix not manage the various way a type can be expressed with a partial path.

Long explanation below:

If we accept the specifications from buf as correct, we can look at how a type reference is defined here.

The main problem with the previous implementation is that the current parser can't tell if a fully-qualified type reference in dot notation is the same as one identified by name alone aka simple name notation (fully qualified reference).

The fix do not consider all the different ways users can define a relative reference, schemas
with different way of expressing a relative reference even if normalized,
for now will keep being considered different.

Right now, our logic removes the .package_name (+ the Message scope) part
before comparing field modifications (in fact it re-writes the schema using the simple name notation).

Even though the TypeTree (trie data structure) could help resolve relative references, we don't want to add this feature in the python implementation now because it might cause new bugs due to the non-trivial behaviour of the protoc compiler.

We plan to properly normalize the protobuf schemas later.

We'll use protobuf descriptors (after the compilation and linking step) to gather type references already resolved, and we will threaten all the protobuf using always the fully qualified names.
Properly handling all path variations means reimplementing the protoc compiler behavior, we prefer relying on the already processed proto descriptor.

So, for now, we'll only implement a normalization for the fully-qualified references in dot notation and by simple name alone.

NB: This is not changing the semantics of the message since the local scope its always the one
with max priority, so if you get rid of the fully-qualified reference protoc will resolve
the reference with the one specified in the package scope.

About this change - What it does

References: #xxxxx

Why this way

@eliax1996 eliax1996 force-pushed the eliax1996/add-strip-full-path-names-and-normalization-of-relative-names branch 3 times, most recently from 3e7ed83 to 5c87e1c Compare July 3, 2024 10:24
Comment on lines 98 to 100
field_type_normalized = type_field.element_type.removeprefix(
"."
).removeprefix( # getting rid of the optional `.` before the package name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the fully qualified references always start with a ., like for the example of: .google.protobuf.Timestamp that its referring to the package google.protobuf.
In this case normalizing it by removing the . can lead to shadowing if the import of the google message its after another import that defines a scope that contains google, like for e.g.:

package my.awesome.tricky.company.google.protobuf;

message Timestamp {
   // definition...
}

Even if this is possible its unlikely to happen, we may even reduce that removal of the . just for types that do not have other anchestors in type type trie

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented it by the type tree and implemented a test

@eliax1996 eliax1996 force-pushed the eliax1996/add-strip-full-path-names-and-normalization-of-relative-names branch 3 times, most recently from a469d30 to 0faa236 Compare July 4, 2024 12:13
@eliax1996 eliax1996 marked this pull request as ready for review July 4, 2024 12:13
@eliax1996 eliax1996 requested review from a team as code owners July 4, 2024 12:13
Copy link

github-actions bot commented Jul 5, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  karapace
  utils.py
  karapace/protobuf
  field_element.py
  proto_file_element.py
  proto_normalizations.py 211, 222
  schema.py
  type_tree.py 66, 73
Project Total  

This report was generated by python-coverage-comment-action

This is not complete normalization feature but rather more a fix for the fully qualified paths vs simple name references. This fix not manage the various way a type can be expressed with a partial path.

Long explanation below:

If we accept the specifications from buf as correct, we can look at how a type reference is defined [here](https://protobuf.com/docs/language-spec#type-references).

The main problem with the previous implementation is that the current parser can't tell if a fully-qualified type reference in dot notation is the same as one identified by name alone aka simple name notation ([fully qualified reference](https://protobuf.com/docs/language-spec#fully-qualified-references)).

The fix do not consider all the different ways users can define a relative reference, schemas
 with different way of expressing a [relative reference](https://protobuf.com/docs/language-spec#relative-references) even if normalized,
 for now will keep being considered different.

Right now, our logic removes the `.package_name` (+ the Message scope) part
 before comparing field modifications (in fact it re-writes the schema using the simple name notation).

Even though the TypeTree (trie data structure) could help resolve [relative references](https://protobuf.com/docs/language-spec#relative-references), we don't want to add this feature in the python implementation now because it might cause new bugs due to the non-trivial behaviour of the protoc compiler.

We plan to properly normalize the protobuf schemas later.

We'll use protobuf descriptors (after the compilation and linking step) to gather type references already resolved, and we will threaten all the protobuf using always the fully qualified names.
Properly handling all path variations means reimplementing the protoc compiler behavior, we prefer relying on the already processed proto descriptor.

So, for now, we'll only implement a normalization for the fully-qualified references in dot notation and by simple name alone.

**NB:** This is not changing the semantics of the message since the local scope its always the one
 with max priority, so if you get rid of the fully-qualified reference protoc will resolve
 the reference with the one specified in the package scope.
@eliax1996 eliax1996 force-pushed the eliax1996/add-strip-full-path-names-and-normalization-of-relative-names branch from 0faa236 to d229909 Compare July 5, 2024 14:50
Copy link
Contributor

@keejon keejon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@keejon keejon merged commit e5a6e41 into main Jul 15, 2024
9 checks passed
@keejon keejon deleted the eliax1996/add-strip-full-path-names-and-normalization-of-relative-names branch July 15, 2024 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants