Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proto: revert UTF-8 validation for proto2 #628

Merged
merged 1 commit into from
Jun 6, 2018
Merged

Conversation

dsnet
Copy link
Member

@dsnet dsnet commented Jun 5, 2018

The proto specification officially says that proto2 and proto3 strings should be
validated, but pragmatically, compliance with the spec has been poor.
For example, the Go implementation did not validate either and added strict
validation recently to be compliant. However, this caused signficant breakage.

Cases of breakage should change the proto field type from string to the bytes type.
However, this is not always possible, when the field is part of the exposed API.
This tends to be the case for proto2, where some other notable language
implementations (like C++) do not validate proto2 for valid UTF-8.
However, since most language implementations do validate for UTF-8 in proto3,
we keep that behavior.

Making this change for Go is a little tricky since each field does not necessarily
know whether it is operating under the proto2 or proto3 syntax. Thus, we modify
the generator to emit a "proto3" struct field tag for all fields in proto3.
The implications of this change is that people will need to regenerate their
proto files to have UTF-8 validation.

We expand UTF-8 validation tests to ensure this works for the cross-product of
(proto2, proto3) and (scalar, vector, oneof, and maps) fields with strings.

Fixes #622

@dsnet dsnet requested a review from neild June 5, 2018 23:42
@dsnet dsnet force-pushed the invalid-utf8-error2 branch 2 times, most recently from 3c62020 to 582a39e Compare June 5, 2018 23:50
Copy link
Contributor

@neild neild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we export errInvalidUTF8?

@@ -1573,7 +1573,9 @@ func (g *Generator) goTag(message *Descriptor, field *descriptor.FieldDescriptor
if *field.Type == descriptor.FieldDescriptorProto_TYPE_BYTES {
name += ",proto3"
}

if *field.Type == descriptor.FieldDescriptorProto_TYPE_STRING {
name += ",utf8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the "proto3" tag instead? We may discover other cases in the future where proto2/proto3 fields need different handling, and it'll be more efficient to have a single tag in that case.

If so, should we preemptively tag all proto3 fields, even if they currently have no semantic difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proto3 tag was already added to distinguish the semantic difference for []byte{} vs []byte(nil) when it comes to the bytes type.

Looking through the code history, the attribute was only generated on the bytes type out of fear for binary size increase. If we want to go with proto3, should we just generate it for all fields? That's an increase of 7 bytes for every field. Alternatively, we can only generate it for the string field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're worried about binary size increase that argues for a proto3 tag rather than utf8, to avoid the possibility of redundant tags in the future.

I don't really have a strong opinion on every field vs. string, although there are a couple arguments for the former: It avoids any possibility of needing to do this again in the future, and I think all the cases where 7 bytes per field might conceivably be an issue are proto2 anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enabled it unilaterally for all fields.

@dsnet
Copy link
Member Author

dsnet commented Jun 6, 2018

Let's avoid exporting the error for now.

Several thoughts:

  • As it stands the current logic fails-fast when there is invalid UTF-8. In order for a distinguishable error to be useful, they would still need to fully marshal or unmarshal the message.
  • We probably wouldn't want a sentinel error. A useful piece of information is the exact field that had UTF-8 error.
  • If we exported a new type, then we have to think carefully about what happens message has both required fields not set and invalid UTF-8. You would probably want to report that both errors occurred. Here's a case where your error tagging idea would be really useful.

The proto specification officially says that proto2 and proto3 strings should be
validated, but pragmatically, compliance with the spec has been poor.
For example, the Go implementation did not validate either and added strict
validation recently to be compliant. However, this caused signficant breakage.

Cases of breakage should change the proto field type from string to the bytes type.
However, this is not always possible, when the field is part of the exposed API.
This tends to be the case for proto2, where some other notable language
implementations (like C++) do not validate proto2 for valid UTF-8.
However, since most language implementations do validate for UTF-8 in proto3,
we keep that behavior.

Making this change for Go is a little tricky since each field does not necessarily
know whether it is operating under the proto2 or proto3 syntax. Thus, we modify
the generator to emit a "proto3" struct field tag for all fields in proto3.
The implications of this change is that people will need to regenerate their
proto files to have UTF-8 validation.

We expand UTF-8 validation tests to ensure this works for the cross-product of
(proto2, proto3) and (scalar, vector, oneof, and maps) fields with strings.

Fixes #622
@dsnet dsnet merged commit 05f48f4 into master Jun 6, 2018
@dsnet dsnet deleted the invalid-utf8-error2 branch June 6, 2018 20:26
jeanbza added a commit to jeanbza/go-genproto that referenced this pull request Jun 8, 2018
This update uses the latest version of the protoc Go plugin, which
reverted a change in UTF-8 validation for proto2 (and added fields
for proto3). This change removed validation of UTF-8 in proto2
and added a new field in proto3 that signals validation.

Relevant PR: golang/protobuf#628
jeanbza added a commit to googleapis/go-genproto that referenced this pull request Jun 8, 2018
This update uses the latest version of the protoc Go plugin, which
reverted a change in UTF-8 validation for proto2 (and added fields
for proto3). This change removed validation of UTF-8 in proto2
and added a new field in proto3 that signals validation.

Relevant PR: golang/protobuf#628
@sbuss sbuss mentioned this pull request Jul 26, 2018
hsanjuan pushed a commit to gxed/go-genproto-googleapis-rpc that referenced this pull request Jan 30, 2019
This update uses the latest version of the protoc Go plugin, which
reverted a change in UTF-8 validation for proto2 (and added fields
for proto3). This change removed validation of UTF-8 in proto2
and added a new field in proto3 that signals validation.

Relevant PR: golang/protobuf#628
@golang golang locked and limited conversation to collaborators Jun 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proto: revert proto2 string UTF-8 validation
2 participants