Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codec: implement protobuf unknown fields checker #6557

Merged
merged 17 commits into from
Jul 29, 2020
Merged

codec: implement protobuf unknown fields checker #6557

merged 17 commits into from
Jul 29, 2020

Conversation

odeke-em
Copy link
Collaborator

@odeke-em odeke-em commented Jul 1, 2020

Adds a pass that if given a protobuf serialized byte sequence and a proto.Message that
would have been unmarshaled, the pass will traverse the proto.Message and byte sequence
and report mismatches and extraneous fields that were included in the byte sequence.

For performance numbers, I've included benchmarks that show that the pass is dirt cheap
to use even in the path of invoking proto.Unmarshal (which doesn't report extraneous fields)

$ benchstat proto.txt check.txt 
name           old time/op    new time/op    delta
Do_serial-8       253ns ± 2%     179ns ± 1%  -29.28%  (p=0.000 n=10+10)
Do_parallel-8     158ns ± 4%     145ns ± 3%   -7.80%  (p=0.000 n=10+10)

name           old speed      new speed      delta
Do_serial-8     371MB/s ± 2%   525MB/s ± 1%  +41.37%  (p=0.000 n=10+10)
Do_parallel-8   596MB/s ± 4%   647MB/s ± 3%   +8.56%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
Do_serial-8        112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)
Do_parallel-8      112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Do_serial-8        4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
Do_parallel-8      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)

ref: #6192

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (see CONTRIBUTING.md)
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Code follows the module structure standards.
  • Wrote unit and integration tests
  • Updated relevant documentation (docs/) or specification (x/<module>/spec/)
  • Added relevant godoc comments.
  • Added a relevant changelog entry to the Unreleased section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer
  • Review Codecov Report in the comment section below once CI passes

@codecov
Copy link

codecov bot commented Jul 1, 2020

Codecov Report

Merging #6557 into master will increase coverage by 0.05%.
The diff coverage is 73.61%.

@@            Coverage Diff             @@
##           master    #6557      +/-   ##
==========================================
+ Coverage   61.29%   61.35%   +0.05%     
==========================================
  Files         509      510       +1     
  Lines       31625    31769     +144     
==========================================
+ Hits        19386    19492     +106     
- Misses      10725    10750      +25     
- Partials     1514     1527      +13     

Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start @odeke-em! Thank you.

I left several comments. At a high-level:

  • we don't need to support golang proto, just gogo proto
  • we need to handle nested message including nesting via Any and oneof - it seems like you may have the Any case sufficiently covered, but it needs to be tested to make sure

codec/enforceproto/extraneous_fields.go Outdated Show resolved Hide resolved
codec/enforceproto/extraneous_fields.go Outdated Show resolved Hide resolved
codec/enforceproto/extraneous_fields.go Outdated Show resolved Hide resolved
codec/enforceproto/extraneous_fields_test.go Outdated Show resolved Hide resolved
codec/enforceproto/testdata/svc.pb.go Outdated Show resolved Hide resolved
codec/enforceproto/testdata/svc.proto Outdated Show resolved Hide resolved
codec/enforceproto/testdata/svc.proto Outdated Show resolved Hide resolved
codec/enforceproto/extraneous_fields_test.go Outdated Show resolved Hide resolved
@aaronc
Copy link
Member

aaronc commented Jul 1, 2020

@odeke-em FYI currently make proto-gen for codec/testdata is broken but I'm fixing it in #6568.

@odeke-em
Copy link
Collaborator Author

odeke-em commented Jul 2, 2020

Thank you for round 1 of reviews @aaronc, I've addressed your feedback and added an extra check to ensure that even if messages have the same field numbers, if their wireTypes don't match, we should trip out.

codec/testdata/proto.proto Outdated Show resolved Hide resolved
Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like good progress @odeke-em. Left some more comments.

I gave some more guidance on how to check for nested fields. Basically we need to have data in nested structs that fails validation which means that CheckMismatchedProtoFields needs to recusively parse messages. Let me know if it would be helpful to chat through how to approach that.

@odeke-em
Copy link
Collaborator Author

odeke-em commented Jul 3, 2020

I gave some more guidance on how to check for nested fields. Basically we need to have data in nested structs that fails validation which means that CheckMismatchedProtoFields needs to recusively parse messages. Let me know if it would be helpful to chat through how to approach that.

Thanks for that, my mind was coming from a different direction just basing off proto definitions getting written but I've crafted up nested structs and am working on that. Updates coming up.

Copy link
Member

@aaronc aaronc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are some things that still need to be covered here:

  • Nesting is insufficiently covered. The one test for nesting just checks one field's wire type, but our primary aim is not to detect wire type changes (which are covered by the core Unmarshal pass) but to test for unknown fields - both critical ones and non-critical ones. Also as far as I can tell, there is no coverage of nesting via:
    • Any
    • oneofs
    • repeated fields that nest simple messages, oneofs and Anys
  • It seems like that are some potential nesting complex cases that need to be handled via reflection that aren't which include pointers, arrays of structs and pointers, oneof's, arrays of oneof's, Any, etc. The TODO` that is in the code I believe is actually larger than it seems... I wonder if we should maybe consider a schema based approach as an alternative to reflection to handle all these cases. Obviously I understand changing tactics could be frustrated as I know you've spent a lot of work on this approach. But I do think it's worth having a conversation about how to handle all the additional cases and the best way to approach that.

Maybe let's see if we can find a time to chat live?

codec/enforceproto/extraneous_fields.go Outdated Show resolved Hide resolved
codec/enforceproto/extraneous_fields.go Outdated Show resolved Hide resolved
codec/enforceproto/extraneous_fields.go Outdated Show resolved Hide resolved
codec/enforceproto/unitest_helpers_test.go Outdated Show resolved Hide resolved
@odeke-em
Copy link
Collaborator Author

odeke-em commented Jul 8, 2020

Thanks for the round of feedback @aaronc I am going to work on in about 1.5hr :)

@aaronc
Copy link
Member

aaronc commented Jul 8, 2020

So before you move forward can we find a time to sync? Ping me on discord or email. I've thought through it a bit more and unfortunately, I'm pretty sure the reflection approach won't actually work to cover all the cases. I can explain more but I think it would be easier if we sync live.

codec/testdata/proto.proto Outdated Show resolved Hide resolved
@odeke-em
Copy link
Collaborator Author

odeke-em commented Jul 14, 2020 via email

@aaronc aaronc mentioned this pull request Jul 15, 2020
27 tasks
@odeke-em
Copy link
Collaborator Author

odeke-em commented Jul 15, 2020 via email

@aaronc
Copy link
Member

aaronc commented Jul 15, 2020

If you take a look at the generated proto, the wrappers aren’t gogoproto.Message nor are there FieldDescriptorProto. The only way to walk through their definitions is to use the OneofIndex, to get the definition and then walk that struct, we have to us reflection at this point. I confirmed that this is also what gogoproto does during Unmarshal.

Using FieldDescriptor I don't actually think you need to worry about oneof's as a special case at all. You can just use FieldDescriptor.number and FieldDescriptor.type. If FieldDescriptor.type is TYPE_MESSAGE then you look at FieldDescriptor.type_name.

You should be able to retrieve the Descriptor for that type_name either by

Basically my read on this is that there really isn't much reason to pull in the go field types at all to do this. We just need the proto.Messages in order to pull off the Descriptors but then everything else is in those descriptors.

@odeke-em
Copy link
Collaborator Author

@aaronc you were right, thank you for the advocacy and calling out the fact that FieldDescriptorProto always gives us the TypeName and we can use that to retrieve the found messageTypes from gogoproto.MessageType.

@odeke-em
Copy link
Collaborator Author

For updated performance numbers:

$ benchstat proto_unmarshal.txt check_mismatchedfields.txt 
name           old time/op    new time/op    delta
Do_serial-8       263ns ± 3%     269ns ± 3%   +2.32%  (p=0.010 n=10+10)
Do_parallel-8     147ns ± 3%     153ns ± 7%   +4.02%  (p=0.007 n=10+10)

name           old speed      new speed      delta
Do_serial-8     358MB/s ± 4%   349MB/s ± 3%   -2.38%  (p=0.009 n=10+10)
Do_parallel-8   642MB/s ± 3%   616MB/s ± 7%   -4.03%  (p=0.007 n=10+10)

name           old alloc/op   new alloc/op   delta
Do_serial-8        112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)
Do_parallel-8      112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Do_serial-8        4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
Do_parallel-8      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)

We have:

  • Less allocations per operation (allocs/op)
  • Less allocations overall (# allocs/op)
  • Similar speed as gogoproto.Unmarshal (ns/op)
  • Similar throughput for gogoproto.Unmarshal (B/s)

@odeke-em
Copy link
Collaborator Author

I just did an algorithmic analysis of the program and noticed a path that is O(n) and sped that up by using a dense and constant time look up in fd2e329, and now this pass is faster in every dimension than invoking proto.Unmarshal, which was the behavior that I had postulated I could achieve even before this project began

$ benchstat proto.txt check.txt 
name           old time/op    new time/op    delta
Do_serial-8       253ns ± 2%     179ns ± 1%  -29.28%  (p=0.000 n=10+10)
Do_parallel-8     158ns ± 4%     145ns ± 3%   -7.80%  (p=0.000 n=10+10)

name           old speed      new speed      delta
Do_serial-8     371MB/s ± 2%   525MB/s ± 1%  +41.37%  (p=0.000 n=10+10)
Do_parallel-8   596MB/s ± 4%   647MB/s ± 3%   +8.56%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
Do_serial-8        112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)
Do_parallel-8      112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Do_serial-8        4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
Do_parallel-8      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)

@odeke-em odeke-em requested a review from aaronc July 22, 2020 09:24
… of O(n)

Noticed from reading through the code and algorithmically analyzing
the costs per switch when checking if a wireType matches a descriptor
type. The appropriate change is by noticing that wireTypes are very
small i.e. 0 to 5 but also we can directly know which descType matches
up, thus mapping a dense slice indexed by protowire.Type, and to
a map of descriptorField_Type to boolean, hence a constant running
time and this change now makes this pass faster than doing
proto.Unmarshal in every dimension.

* After:
```shell
$ benchstat proto.txt check.txt
name           old time/op    new time/op    delta
Do_serial-8       253ns ± 2%     179ns ± 1%  -29.28%  (p=0.000 n=10+10)
Do_parallel-8     158ns ± 4%     145ns ± 3%   -7.80%  (p=0.000 n=10+10)

name           old speed      new speed      delta
Do_serial-8     371MB/s ± 2%   525MB/s ± 1%  +41.37%  (p=0.000 n=10+10)
Do_parallel-8   596MB/s ± 4%   647MB/s ± 3%   +8.56%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
Do_serial-8        112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)
Do_parallel-8      112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Do_serial-8        4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
Do_parallel-8      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
```

* Before:
```
$ benchstat proto.txt check.txt
name           old time/op    new time/op    delta
Do_serial-8       263ns ± 3%     269ns ± 3%   +2.32%  (p=0.010 n=10+10)
Do_parallel-8     147ns ± 3%     153ns ± 7%   +4.02%  (p=0.007 n=10+10)

name           old speed      new speed      delta
Do_serial-8     358MB/s ± 4%   349MB/s ± 3%   -2.38%  (p=0.009 n=10+10)
Do_parallel-8   642MB/s ± 3%   616MB/s ± 7%   -4.03%  (p=0.007 n=10+10)

name           old alloc/op   new alloc/op   delta
Do_serial-8        112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)
Do_parallel-8      112B ± 0%       48B ± 0%  -57.14%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Do_serial-8        4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
Do_parallel-8      4.00 ± 0%      2.00 ± 0%  -50.00%  (p=0.000 n=10+10)
```

which is a big speed up in every dimenion by a considerable amount.
…ng indices

Improve the context for the next reader to describe the mapping
between indices and their various elements.
Brings back comments that previously existed when we had the switch.
Also while here, add an equivalence test for walking
slices.
Address @fedekunze's feedback about add docs to the various types.
While here, also added a doc.go file which will be our entry point
when programmers reference this code; also unexported the errors
so that our external API now solely consists of the function:

    CheckMismatchedProtoFields
Addressing points raised by @aaronc

Also added a check for scalarity of TYPE_BYTES which would
be equivalent to pretty much TYPE_STRING, whose scalarity is
special cases when we can't find the appropriate typename.
Addressing @aaronc's feedback on the name bikeshed.
…customizer

For security, by default, we'll report every single unknown field
regardless of whether it is critical or non-critical. However, provide
a bulky option to customize this behavior by providing a Checker type
whose AllowUnknownNonCriticals value can be toggled, but that'll
be opt-in.
* Renmaed CheckMismatchedFields to RejectUnknownFields
* Removed the global default in favor of manually creating a checker
and invoking RejectUnkownFields on a receiver
* Added tests to check for unknown fields in the midst of repeat fields
* Updated docs and some more tests
@odeke-em
Copy link
Collaborator Author

Btw no need to rebase and force push because we squash anyway. You can just merge master in.

Kinda tricky for my fork with the new history being applied to the head of my branch and commits, but this is the last commit anyways given the change from master, so I'll still rebase :)

@odeke-em odeke-em requested a review from anilcse July 29, 2020 00:13
Copy link
Collaborator

@anilcse anilcse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, may be a followup PR for cleaning up tests would be great!

Comment on lines +179 to +181
// google.protobuf.Timestamp i = 10;
// google.protobuf.Timestamp j = 11; // [(gogoproto.stdtime) = true];
Customer1 k = 12 [(gogoproto.embed) = true];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires cleanup?

Suggested change
// google.protobuf.Timestamp i = 10;
// google.protobuf.Timestamp j = 11; // [(gogoproto.stdtime) = true];
Customer1 k = 12 [(gogoproto.embed) = true];
Customer1 k = 10 [(gogoproto.embed) = true];

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps do this cleanup in a follow-on PR? I ask because in our discussion we wanted some of these fields for test cases but they couldn't work out quite as planned. Important to keep them first so they don't get lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's address in a follow-up PR. I'm adding a note in #6192

@aaronc aaronc mentioned this pull request Jul 29, 2020
3 tasks
@aaronc
Copy link
Member

aaronc commented Jul 29, 2020

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jul 29, 2020

Command update: failure

Branch update failed
user doesn't have permission to update head repository
err-code: C19E9

@aaronc
Copy link
Member

aaronc commented Jul 29, 2020

@odeke-em please merge master into this branch, we don't seem to have permissions. (Again no need to rebase - just git pull origin master - it'll be much simpler)

@odeke-em
Copy link
Collaborator Author

Done with the merge, please take a look!

@aaronc
Copy link
Member

aaronc commented Jul 29, 2020

@odeke-em you should see an "Update branch" button in the PR, you're going to need to keep hitting it everytime there's a commit until this is merged.

@odeke-em
Copy link
Collaborator Author

@odeke-em you should see an "Update branch" button in the PR, you're going to need to keep hitting it everytime there's a commit until this is merged.

Gotcha, thanks for the heads up!

@aaronc aaronc merged commit b0c73ae into cosmos:master Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. C:Encoding T: Security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants