Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/embedlite: a low level alternative to embed which can be imported by any std package #51463

Closed
mvdan opened this issue Mar 3, 2022 · 26 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mvdan
Copy link
Member

mvdan commented Mar 3, 2022

See https://go-review.googlesource.com/c/go/+/389834; this change uses go:embed in time/tzdata to remove thousands of lines of code, which is great, but it also runs into an import cycle in the form of time/tzdata -> embed -> io/fs -> context -> time -> time/tzdata.

I think it's clear that it would be a bad outcome to say "some std packages can't embed files". In this particular case it's a net benefit, and at the end of the day I just want to embed a file into a string var, so it's doable in practice. The old code did it by hand, anyway. The tzdata package doesn't need to import io/fs in practice.

I propose that we add a package like internal/embedlite which allows using //go:embed directives with []byte and string, and is hard-coded by the toolchain to be allowed for use in std, but it forbids the use of embedding into io/fs.FS - for those use cases, just use plain old embed.

The package itself would be practically a placeholder, with no exposed API. It would also have no dependencies other than the bare minimum, the runtime. The toolchain would learn about internal/reflectlite, just like it knows about embed, allowing the use of embedding into bytes of strings if either package is imported. So then, time/tzdata could do:

import _ "internal/embedlite"

//go:embed zoneinfo.zip
var zoneinfo string
@mvdan mvdan changed the title internal/embedlite: a low level alternative to embed which can be imported by any std packages internal/embedlite: a low level alternative to embed which can be imported by any std package Mar 3, 2022
@mvdan
Copy link
Member Author

mvdan commented Mar 3, 2022

I should note that, in theory, any runtime or syscall related package should be able to import this internal/embedlite package, but I don't think that is a requirement to begin with. If packages like time/tzdata can import it, that's all I need for now.

@mvdan
Copy link
Member Author

mvdan commented Mar 3, 2022

cc @rsc as the owner of time, @bradfitz since he collaborated on the original embed proposal, and @mdempsky @cherrymui given their involvement in the original proposal discussion and CLs :)

@mvdan
Copy link
Member Author

mvdan commented Mar 4, 2022

I forgot to say: I'm happy to do the implementation as a separate CL if this sounds like a good idea. I imagine it shouldn't be too difficult, in terms of teaching the compiler about the new special package.

@beoran
Copy link

beoran commented Mar 4, 2022

This would also be very useful outside the compiler. Sometimes a string or byte array is all we need...

@mvdan
Copy link
Member Author

mvdan commented Mar 4, 2022

@beoran not sure I understand what you mean; my use case is for time/tzdata, which is outside the compiler.

If you mean it would also be useful outside the standard library, to not pull in io/fs as a dependency, that's not really part of my proposal given that it would be an internal package. io/fs doesn't seem like a particularly heavy dependency to worry about; it's only worrysome for time/tzdata given the import cycles. Unless you have compelling examples you can bring up?

@mvdan
Copy link
Member Author

mvdan commented Mar 4, 2022

Also, to throw an alternative idea out there: we could relax the rule that the embed package must always be imported for //go:embed to work, then the indirect dependency on io/fs causing the import cycle would no longer be a problem.

I can't say whether this would be better or worse than internal/embedlite as a solution. One key difference is that we would change how embed works outside of the standard library, though, as we would relax the "must import" rule. So perhaps a middle ground is to only relax the rule for the standard library.

@beoran
Copy link

beoran commented Mar 4, 2022

Relaxing the requirement for the embed import, also outside of the standard library is a great idea. Many older tools similar to //go:embed would just generate strings or byte arrays and not require any special package imports. That is the use case I am thinking of.

@antichris
Copy link

#43217 was declined at the time due to the concerns expressed in #43217 (comment). But, might it be worth revisiting that decision soon-ish, when as of 1.18 all supported Go versions support //go:embed?

@mvdan
Copy link
Member Author

mvdan commented Mar 4, 2022

@antichris thanks for that link, I hadn't noticed that thread last year. I tend to agree with @bcmills's argument that requiring an import does solve multiple issues with old Go versions. I don't think we can call that edge case solved once enough Go versions have passed; there will always be some people trying to use ancient versions of Go, and in the case that there are no other build errors, we don't want go:embed to silently not work.

So, to reiterate, my proposal is to either:

  1. Add a practically empty internal/embedlite package, only available to std, and teach the toolchain that any package importing it is allowed to embed into string or []byte
  2. Remove the restriction to import embed to use go:embed for any package inside the standard library

Either way, this proposal is not about third party packages using go:embed. Any such proposal should be separate, in my opinion.

@mvdan
Copy link
Member Author

mvdan commented Mar 6, 2022

The more I think about it, the more I lean towards option 2 in my last comment: relax the import "embed" requirement for packages in the standard library. This is easier to implement and maintain (no need to add a dummy package), and will not require teaching src/go/build/deps_test.go about a new dummy package either.

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 7, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/391455 mentions this issue: cmd/compile,go/build: remove embed import restriction for std

@martin-sucha
Copy link
Contributor

Isn't the check for importing embed also used to skip some work if no embed is used in the file? Does option 2 have any performance impact on compiling the standard library or is the effect negligible?

@mvdan
Copy link
Member Author

mvdan commented Mar 11, 2022

That's a good question, and one that I pondered as well. If you look at the CL above, you'll notice that the change to the compiler itself is just a boolean check, so it should have no impact.

The other change is go/build.Context.matchFile, which gets used by a number of tools including go build. Right now, what it does is only collect the //go:embed directives if it can first find the embed package import. I've changed it so that, if the package being imported is part of the standard library, it will collect the //go:embed directives anyway.

The worst case scenario for this second change is a Go file belonging to a standard library package which does not import embed nor have any //go:embed comments, as then we do extra work. However, the extra work consists of a single call to findEmbed, which will tokenize the entire file looking for the non-existent directives. I personally don't imagine this will be a noticeable slow-down, but I haven't measured it either. It's worth noting that go/build itself has no benchmarks, and its contribution to go build std will most likely be negligible. I'm not sure what would be a better or more realistic way to measure the impact.

If we are worried about the impact, the alternative is internal/embedlite. I would prefer to get some input from the owners of cmd/go and go/build before I spend more time on the CL, as right now it's more of a working prototype :)

@mvdan
Copy link
Member Author

mvdan commented Mar 17, 2022

I'm actually running into cmd/go test timeouts when testing with the data race detector, and with @bcmills we're thinking that the slow-down from scanning more Go files does indeed add up. See https://go-review.googlesource.com/c/go/+/389834/4#message-38911a2b286de7d061b1ba36f7ed2c77dfbb046b.

So I'm thinking that my reasoning above with regards to performance is wrong, and we should go back to the original idea of internal/embedlite. Thoughts? (please read the CL before commenting)

@mvdan
Copy link
Member Author

mvdan commented Apr 10, 2022

So I'm thinking that my reasoning above with regards to performance is wrong, and we should go back to the original idea of internal/embedlite. Thoughts? (please read the CL before commenting)

I've now implemented internal/embedlite; see https://go-review.googlesource.com/c/go/+/391455.

@rsc
Copy link
Contributor

rsc commented May 5, 2022

Is there some other way we can do this? It bothers me that any analyzer that wants to understand embed (or at least understand package time) now has to know that "internal/embedlite" is the same as "embed". That's not a very internal detail at that point.

@rsc
Copy link
Contributor

rsc commented May 5, 2022

Continuing my previous comment: For example, if staticcheck knew about embed, it would also need to know about internal/embedlite.

I think it's clear that it would be a bad outcome to say "some std packages can't embed files".

For what it's worth, it's not as clear to me. Today we say "some std packages can't import fmt", because we value having an acyclic dependency graph more than having fmt everywhere. It seems OK to say "some std packages can't import embed" for the same reason. The specific set of (non-internal) std packages is: errors, io, io/fs, math/bits, path, runtime, sort, sync, sync/atomic, syscall, time, unicode/utf8, unsafe. Of these, essentially all of these shouldn't have large embedded data. The only (partial) exception is time. Working around the problem in time seems better than creating a new name for embed that every analyzer needs to know.

The current tzdata generator makes short strings and +'s them together, resulting in a parsed AST of depth 7,000. It is probably better to just use one very long string literal. It's not like diffs will be meaningful with shorter lines. @bcmills is going to look into that. (If the long literal is problematic, another option is +'ed together shorter strings but using parentheses to make the AST a balanced binary tree).

Worst case we could special case the timetzdata build tag in some way in cmd/go. That would still be less overall impact to analyzers than internal/embedlite.

@seankhliao
Copy link
Member

If it's just for that small set, can it use //go:linkname?

@bcmills
Copy link
Contributor

bcmills commented May 5, 2022

Just for kicks I tried encoding the zipfile as a UTF-8 string with the bare minimum of escaping for a valid Go source file — specifically, encoding \\, \", \n, \x00, \uFEFF, and sequences that aren't valid UTF-8 — but that only brought the file down from 1.4M to 1.2M. (Apparently the zipfile contains a lot of \x00? 🤔)

So that's not much of an advantage over just using strconv.Quote to reduce it to a single line. 😞

@bcmills
Copy link
Contributor

bcmills commented May 5, 2022

@seankhliao, the problem is that the user code only imports time. If we use go:linkname to put the go:embed in anything imported by time we end up with the import cycle. If we use go:linkname to put it in anything not imported by time then we end up with an unresolved reference.

@ianlancetaylor
Copy link
Member

Note that although the string constant in zipdata.go is in the form of a zip file, it is not compressed. So, yes, it has a lot of zero bytes. We could only compress the data if we wanted to implement some sort of decompression in loadFromEmbeddedTZData.

@ianlancetaylor
Copy link
Member

And, to be clear, the string constant doesn't have to be in the form of a zip file at all. That was just convenient because we are starting with a zip file ($GOROOT/lib/time/zoneinfo.zip). But generate_zipdata.go could parse that zip file and produce any format we like. I think the main constraint is that the data has to take the form of a string constant, so that importing time/tzdata as a backup data source does not increase program startup time or memory usage.

@rsc
Copy link
Contributor

rsc commented May 5, 2022

The nice thing about the uncompressed zip file is time not depending on a compression algorithm. :-)

@rsc
Copy link
Contributor

rsc commented May 5, 2022

Another thing you could do is make cmd/dist build it as one of the generated z-files. Now that it's just 'printf %q' that seems entirely reasonable.

@mvdan
Copy link
Member Author

mvdan commented May 9, 2022

It bothers me that any analyzer that wants to understand embed (or at least understand package time) now has to know that "internal/embedlite" is the same as "embed".

You're right, I hadn't thought about that. I guess that also applies to my original idea to lift the restriction on importing embed, because then some tools might also have to learn that quirk, even if we teach go/build.Import about it.

The current tzdata generator makes short strings and +'s them together, resulting in a parsed AST of depth 7,000. It is probably better to just use one very long string literal. It's not like diffs will be meaningful with shorter lines. @bcmills is going to look into that. (If the long literal is problematic, another option is +'ed together shorter strings but using parentheses to make the AST a balanced binary tree).

Funnily enough, that is exactly the approach I took for a different package in https://go-review.googlesource.com/c/go/+/380474. The difference there is that we want a const string, so no embedding. And the contents aren't duplicated elsewhere in the git tree, anyway, so moving the bytes out of a Go file and into an embedded file wouldn't be a huge difference.

I think Bryan's solution in https://go-review.googlesource.com/c/go/+/404435 is similarly OK as a fix for the build/reformat/tooling slowness. But quoting Bryan from the CL, it still leaves us with the two other wrinkles:

  1. redundant bytes included in the go distribution,
  2. an extra go generate step when updating the tzdata,

Another thing you could do is make cmd/dist build it as one of the generated z-files. Now that it's just 'printf %q' that seems entirely reasonable.

That seems like the best solution overall, then. There is still a code generation step of sorts, but it's not done manually by the developer, and its result is not committed into git, meaning there isn't duplication in the source distribution (even if there is still duplication in binary archives).

@dmitshur dmitshur added this to the Backlog milestone May 30, 2022
@mvdan
Copy link
Member Author

mvdan commented Jun 22, 2022

Per my comment above, I'd be OK with either using a single long string and checking it into VCS (Bryan's suggestion) or teaching cmd/dist to generate that single long string (Russ's suggestion). In any case, I no longer think internal/embedlite is a good suggestion, so I'm going to retract this light proposal in favor of #43350, the more general and older thread on this topic.

@mvdan mvdan closed this as completed Jun 22, 2022
@golang golang locked and limited conversation to collaborators Jun 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests