-
Notifications
You must be signed in to change notification settings - Fork 15.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produce a complete spec for proto2 and proto3 #6188
Comments
I wholeheartedly agree that a published official spec would be hugely valuable to the community. A spec should be a lot more than just the grammar. The grammar defines the syntax (and is currently available on the developer site, albeit old/wrong), but a spec should also be clear about semantics. Formal language specs usually interleave the two. The Go spec is a relatively simple one and I think a good example: https://golang.org/ref/spec. It is typical in that it describes the type system and other language constraints with prose. (While there are actually formal ways to describe semantics and type systems, but I don't think I've ever seen them used outside of a compilers course.) The developers site has a lot of information on the semantics. But it would be nice if it were wound up with the syntax in a real language spec. Also, there are some places where documentation is quite light (especially around some of the more complicated aspects, such as how options are validated/interpreted). I believe it is possible to produce a solid spec without adding an experimental parser. The downside is that they can get out of sync (changes in the parser not properly reflected in the spec and vice versa). But, from what I understand, most significant compilers use hand-written parsers for (1) performance or (2) ability to construct more user-friendly error messages when syntax issues are encountered (or both). As far as making sure a hand-written parser stays in sync w/ the published grammar, perhaps a fuzzer could handle that job. I'm thinking something that can examine the spec and from it produce all kinds of valid and invalid inputs. It could then feed those through the hand-written parser to make sure it accepts valid programs and rejects invalid ones, per the grammar. (I guess that could very well be a significant effort as well... TBH, I think most languages don't even try to have that sort of validation and instead use careful code review guidelines to make sure the two don't get out of sync...) |
Yes, this is a very big ask indeed. It is a problem that we are aware of, and would like to address, but it is not something that we have bandwidth to prioritize at this moment. That said, we have created https://github.com/protocolbuffers/protobuf-grammar repo, and would love to get community contribution to it! |
@haon4 good news , I also found that repo few days ago but it seems there is nothing in it |
This as a community project defeats the purpose of creating a definitive grammar/spec that protocolbuffers/protobuf holds itself to - without that component, having a community-written spec is no improvement over the current situation from what I can see, and in fact could create even more fragmentation and confusion, just my two cents. |
It's not so much a "community-written spec" that I see as an issue, but that there needs to be a formal statement that the spec is the authority on correctness, not the C++ In other words, once a spec is written, any divergence from the spec in an implementation should be a considered a bug and the implementation fixed rather than the spec (except for exceptional cases). For example, in the early days of Go there were two implementations of Go, the original |
I'd like to submit a PR to Google's hiring staff to merge @pedgeio onto the team to work on said spec!! 😄 |
To be honest, I highly doubt anyone within Google can come up with a spec and say that it correctly matches the
Once (1) and (2) are done, and we have good confidence that they match what we want, then we can declare that the specification is finalized, and any divergence from the spec should be fixed. If we do need to update the spec in the future, we will need to come up with a process to notify all parser implementations, as well as providing new conformance tests to include the changes in the revision. If you would like to contribute, please send PRs to https://github.com/protocolbuffers/protobuf-grammar. We can help review them. |
There's now a published spec for text format, which is also used in the https://developers.google.com/protocol-buffers/docs/text-format-spec |
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment. This issue is labeled |
Last night, I read Ian Lance Taylor's commentary on Golang being Google's language, and it has a very poignant observation:
I've been using various OSS Protobuf parsers for the better part of 18 months now, and this really hit home. Protobuf is an example of the latter, namely that
protoc
that serves as the definition of what the Protobuf language is. This is fine for most cases, but there is a lot of unlocked potential as to what can be built for Protobuf that is held back by this fact. The easiest example is for linters and formatters, but there are many more, and not just in the static analysis realm. The only way to be 100% certain you have a valid parsing of Protobuf is to shell out toprotoc --descriptor_set_out --include_imports --include_source_info
and use the generatedFileDescriptorSet
, which in itself is unwieldy to use for such a purpose.As illustrative examples, I've been working in Golang primarily for a while, and there are 10-20 Protobuf parsing libraries on GitHub that attempt to handle Protobuf parsing natively in Golang, however there are only two I've found that come close to actually handling the Protobuf spec.
The first library is emicklei/proto. This is an attempt at doing a parse of Protobuf that does not deal with
FileDescriptorSets
, and is relatively easy to use for static analysis since it produces file:line:column information and parses comments. However, even given the extensive effort of the maintainer (Extensive with a capital "E"), it's not even close. First,emicklei/proto
doesn't purport to decide whether your files are valid Protobuf, it will produce a parsed result for files that aren't valid, for example missing semicolons. This means that for any actual use of it, you first need to verify that files are valid Protobuf usingprotoc
. Second, due to all the various undocumented edge cases, especially around options, it took a lot of work to identify and fix many parsing failures. I usedemicklei/proto
to build Prototool's linting and formatting functionality, and through Prototool's adoption, Prototool effectively became a massive integration test onemicklei/proto
, which led us to finding a lot more issues. This trial-and-error method was the only practical way to find all the issues. While most have been found, to this day new things come up. For example, I ranemicklei/proto
through a test yesterday outside of Prototool against googleapis, and just recently a multi-line field declaration was added for the first time, which led to a new issue being filed.The second library, is jhump/protoreflect's desc/protoparse package by @jhump. Josh effectively attempted to codify the Protobuf specification himself, I am in awe of how close this comes to
protoc
.jhump/protoparse
does actually validate whether or not a Protobuf file is valid, and it also generatesFileDescriptorSets
- in fact, Josh wrote goprotoc as aprotoc
replacement in Golang, which comes tantalizingly close to matchingprotoc
-generatedFileDescriptorSets
on a byte-by-byte level. I've been hittingjhump/protoreflect
andjhump/goprotoc
with a tiny-sized battering ram (relative to what we could have, see below) in pedgeio/goprotoctest, and it's getting awfully close - it already produces byte-equivalentFileDescriptorSets
for every directory ingoogleapis
when using--include_imports
. However, through this testing, we still identified some issues that had to be fixed, and there are some outstanding issues that Josh is working on.I can't emphasize enough how impressed I've been with this library - I codified every issue found in emicklei/proto within goprotoctest and it passed with flying colors. However, even given Josh's gargantuan effort here, since there's no official Protobuf spec, we're left to doing trial-and-error against known Protobuf libraries, which doesn't instill a ton of confidence. While I think
goprotoc
is close to matchingprotoc
, it's tough when this is the best that can be practically done.Both of these serve to illustrate the issue with Protobuf being defined by an implementation and not a specification - if you want to do anything that doesn't use the implementation, you're out of luck. There is a proto2 spec and proto3 spec posted on developers.google.com, but the dirty secret is that neither of these are fully accurate, which is why developers like @emicklei and @jhump have to do such massive efforts to get this right. Of note, despite the fact that those specs are not accurate, there are a bunch of parsers relying on those specs (search "protobuf parser" on GitHub and look around if you are interested, or I can follow up with examples) that are being used right now.
This is holding the Protobuf community back - if you want to parse Protobuf files and do anything with them, your options are:
protoc
, which is not great for many reasons, and only producesFileDescriptorSets
which are not always the best representation for certain kinds of work.edge cases codified in
protoc
.There are so many things that could be built with Protobuf if this wasn't the case.
What I'm proposing:
protoc
uses this spec itself for parsing. The specification needs to be become the definition of the language, not theprotoc
implementation, and optimally this means thatprotoc
uses it like any other library would. This is not 100% necessary and could have drawbacks, but given that Protobuf parsing is so quick as it is, it would be great.protoc
already does this since that's the main function ofprotoc
, but if a spec were available, having this in i.e. google.golang.org/protobuf should be much easier. This would help eliminate the slight differences in generatedFileDescriptorSets
in other languages.This is a major task. Actually producing a spec is probably not the biggest issue - even with the options edge cases, this should not be a big blocker. However, moving
protoc
to this is a potentially internet-breaking change if there are bugs. Testing byte-equivalence ofFileDescriptorSets
for a couple repositores like goprotoctest does wouldn't be enough, you'd want to do this integration testing across all of Google's internal Protobuf definitions. There's also the problem of the existing code people may be importing from src/google/protobuf/compiler - unless you bump to 4.0 (which I do not think you want to do, I think this should be a backwards-compatible change), you need to keep that code around.What I'd roughly propose:
FileDescriptorSets
in i.e.src/google/protobuf/compiler2
, and put it's use behind a flag such as--experimental_parsing
as part of a release.FileDescriptorSets
for every definition at Google you are allowed to get your hands on that will runprotoc --descriptor_set_out
andprotoc --descriptor_set_out --experimental parsing
for the matrix of--include_imports
and--include_source_info
on and off. How to do this would be the subject of a lot of debate, some proposals: a massive integration test against the internal monorepo, which will take forever, or rewrite internal Protobuf build rules to call--descriper_set_out
and--descriptor_set_out --experimental_parsing
across theinclude_imports/include_source_info
matrix for any invocation ofprotoc
. If you detect a difference, internally log the path to the files, and the diff. Run this for a few months. This second proposal might not go down very well, as you're going to be logging a lot of internal information you don't want to be, and it will slow down people's builds in theory, although minimally sinceprotoc
is so quick.I know this is a big ask, but Protobuf has been getting a lot more adoption over the last few years, and for it to really go to the next level, we need a common agreed-upon spec to build tooling around. Look at Golang's static analysis tool growth as a great example of why this is so useful - a big part of why I like working in Golang is I have so much tooling I can rely on that was produced outside of Google. Beyond static analysis tools, this could eventually lead to a world where Protobuf code generators no longer need to be written as
protoc
plugins that takeFileDescriptorSets
as input, instead working directly from the Protobuf files themselves.Let me know any thoughts here, this is just a proposal, easier said than done. I appreciate the time you took to read this if you've gotten this far :-)
The text was updated successfully, but these errors were encountered: