-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: darwin builders occasionally failing #12979
Comments
I just had the same thing on my macbook, so it may be unrelated to system time (I modified runtime/) |
If you modified the runtime package, that seems like the correct error message. This bug is specific to why this error occurs spuriously with the builders during test sharding. |
To repro on a Mac,
Next question: what's wrong with the 63e90c3.tar.gz file? |
Perhaps @shanemhansen wants to join in this bug hunt... :) In func (tw *Writer) writeHeader(hdr *Header, allowPax bool) error {
...
// TODO(shanemhansen): we might want to use PAX headers for
// subsecond time resolution, but for now let's just capture
// too long fields or non ascii characters
...
tw.numeric(s.next(12), modTime, false, paxNone, nil) // 136:148 --- consider using pax for finer granularity So I guess we're only storing second-granularity modtimes. |
The modtime for VERSION sticks out. Every other file in that archive has an mtime from yesterday.
|
If I dump the modtime details, package main
import (
"archive/tar"
"compress/gzip"
"fmt"
"io"
"log"
"os"
"time"
)
func main() {
f, err := os.Open("/Users/bradfitz/Downloads/63e90c3fca85dc06294ed12e282c7901a7d7986b.tar.gz")
if err != nil {
log.Fatal(err)
}
zr, err := gzip.NewReader(f)
if err != nil {
log.Fatal(err)
}
tr := tar.NewReader(zr)
for {
h, err := tr.Next()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
fmt.Printf("%v %v %s\n", h.ModTime.Format(time.RFC3339), h.FileInfo().Mode(), h.Name)
}
} And print them sorted,
You see that We probably need to sanitize the modtimes somewhere when we generate or read tarballs. |
Maybe a ghetto solution can be to touch .a files before archiving. |
Yeah, I verified that works and that's probably what I'll do, but I'd like to understand where the discrepancy comes from first. It's probably some timezone issue between the various servers involved in the build system. |
Touch runtime.a will make every other package archives out of date
because everything depends on runtime.
This will have the side effect of rebuilding every non-runtime package
during testing.
|
Can't they all be touched to the same time? |
Yeah, that's a viable approach as long as their timestamp are
all after their source.
What if we force the order of creating the tarball? We add src/
first and then pkg/? (To catch the problem though, we probably
want to shell out to system tar to create a backup of the
environment so that we can later debug what has happened.)
|
There is no "system tar" on Windows and Plan 9. And I don't see how changing the order of creating the tarball matters if the timestamps are wrong on on the filesystem already. |
To me this bug looks like another manifestation of I wonder if we're being caught by a regression. see the comment: "The mtime On Mon, Oct 19, 2015 at 12:00 PM, Brad Fitzpatrick <[email protected]
|
I'm writing a fix for this now and am testing. On my local machine during tests of the modtime verification and rewrite function, I found it triggering after a fresh local nuke + make.bash, with:
... which is something I hadn't thought of before. Fix ongoing. |
CL https://golang.org/cl/16400 mentions this issue. |
Well, I deployed https://golang.org/cl/16400 and now every build fails on OS X. That's the opposite of the intended effect, but very interesting. /cc @adg @randall77 |
https://storage.googleapis.com/go-build-snap/go/darwin-amd64-10_10/86b0a658b27607d128471a7ac52fe0c11bfde97d.tar.gz is a snapshot of a build which just failed: https://storage.googleapis.com/go-build-log/86b0a658/darwin-amd64-10_10_716f5279.log But untarring it locally on my Mac, I no longer see the runtime as stale:
... which seems like progress, at least. And:
@crawshaw and I have lost access to some of the Mac minis racked in a Google datacenter. They're alive and restart when we push new binaries to GCS, but we can't ssh to them anymore. Because at least one of those machines was responsible for the failures recently, I wanted to reproduce there or check their timezones. This is frustrating. I think I'll just disable test sharding on OS X for now. |
Sent https://go-review.googlesource.com/16440 to disable OS X sharding for now. |
Workaround until the Mac failures are understood. Updates golang/go#12979 Change-Id: I15b9ea8f4b708ebf9b7c6ad61e65d0f9eaaa6d73 Reviewed-on: https://go-review.googlesource.com/16440 Reviewed-by: David Crawshaw <[email protected]>
Despite disabling sharding, I just saw this fail again on a trybot run from @aclements: https://storage.googleapis.com/go-build-log/8e3ce754/darwin-amd64-10_10_26784189.log
Coordinator said: (confirming it only used one mac, "stadium2")
The mystery grows. |
Are we sure this is related to the builders and not generally to Darwin? For example, part of determining staleness is comparing mtimes, and HFS+ mtimes have only 1 second resolution. When you disabled sharding, the runtime didn't import any packages, so all that mattered was how the source mtimes compared with the runtime.a mtime and the sources were unpacked well before the runtime was built. Now, however, the runtime imports two internal packages that were probably built a fraction of a second before the runtime itself. They could easily have equal mtimes on HFS+, though that won't cause staleness. However, if mtimes went slightly backwards (NFS adjustment?), it could trigger this. |
Hopefully it's a problem like that. I'm glad to see this happen on a single machine rather than N. Perhaps we can add some strategically-placed time.Now() snapshots in the runtime tests to include in the error message, to catch if time is ever going backwards? |
How about just running a test loop on a trybot for a while? It may actually need to interact with the file system and see what mtimes it gets. |
CL https://golang.org/cl/18085 mentions this issue. |
Now that we have Mac capacity again, I'm going to re-enable OS X trybot sharding and see what happens here. /cc @josharian |
Ignore the old darwin-{amd64,386}-10_10 builders. Don't give them an error, but pretend they don't exist. Also: switch trybots from OS X 10.10 to OS X 10.11, and re-enable sharding. Let's hope for the best. See golang/go#12979. This also enables subrepo tests for all OS X versions. darwin-386-* is currently offline, pending some golang/go#17009 Updates golang/go#9495 (OS X virtualization) Change-Id: I4d53a79087404b5e8051d1aff0c668a92625f442 Reviewed-on: https://go-review.googlesource.com/28583 Reviewed-by: Brad Fitzpatrick <[email protected]>
2 weeks and seems fine. Closing. |
The mac builders are occasionally failing with:
This seems to be an issue related to the system time between the 4+ builders, and/or the time granularity in our .tar.gz snapshots when we shard tests between machines. (We build on one, snapshot it to a tar.gz, and then untar on N other machines....)
It appears that on a machine later running the tests from the snapshot, it thinks the runtime is stale when it actually isn't.
I haven't debugged yet.
/cc @adg @crawshaw
The text was updated successfully, but these errors were encountered: