Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: darwin builders occasionally failing #12979

Closed
bradfitz opened this issue Oct 18, 2015 · 25 comments
Closed

x/build: darwin builders occasionally failing #12979

bradfitz opened this issue Oct 18, 2015 · 25 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Milestone

Comments

@bradfitz
Copy link
Contributor

The mac builders are occasionally failing with:

--- FAIL: TestCgoCrashHandler (0.03s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCgoSignalDeadlock (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCgoTraceback (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCgoCallbackGC (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCgoExternalThreadPanic (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCgoExternalThreadSIGPROF (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCgoExternalThreadSignal (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCrashHandler (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestSimpleDeadlock (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestInitDeadlock (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestLockedDeadlock (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestLockedDeadlock2 (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestGoexitDeadlock (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestStackOverflow (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestThreadExhaustion (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestRecursivePanic (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestGoexitCrash (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestGoNil (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestMainGoroutineId (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestNoHelperGoroutines (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestBreakpoint (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestGoexitInPanic (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestPanicAfterGoexit (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestRecoveredPanicAfterGoexit (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestNetpollDeadlock (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestCrashDumpsAllThreads (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestGcSys (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestGCFairness (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestLargeStringConcat (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
FAIL

This seems to be an issue related to the system time between the 4+ builders, and/or the time granularity in our .tar.gz snapshots when we shard tests between machines. (We build on one, snapshot it to a tar.gz, and then untar on N other machines....)

It appears that on a machine later running the tests from the snapshot, it thinks the runtime is stale when it actually isn't.

I haven't debugged yet.

/cc @adg @crawshaw

@bradfitz bradfitz added the Builders x/build issues (builders, bots, dashboards) label Oct 18, 2015
@nodirt
Copy link
Contributor

nodirt commented Oct 18, 2015

I just had the same thing on my macbook, so it may be unrelated to system time (I modified runtime/)

@bradfitz
Copy link
Contributor Author

If you modified the runtime package, that seems like the correct error message.

This bug is specific to why this error occurs spuriously with the builders during test sharding.

@bradfitz
Copy link
Contributor Author

To repro on a Mac,

  1. download https://storage.googleapis.com/go-build-snap/go/darwin-amd64-10_10/63e90c3fca85dc06294ed12e282c7901a7d7986b.tar.gz
  2. mkdir /tmp/issue12979
  3. cd /tmp/issue12979 && tar -zxf 63e90c3fca85dc06294ed12e282c7901a7d7986b.tar.gz
  4. GOROOT=/tmp/issue12979 ./bin/go test -short runtime

Next question: what's wrong with the 63e90c3.tar.gz file?

@bradfitz
Copy link
Contributor Author

Perhaps @shanemhansen wants to join in this bug hunt... :)

In go/src/archive/tar/writer.go I see:

func (tw *Writer) writeHeader(hdr *Header, allowPax bool) error {
...
        // TODO(shanemhansen): we might want to use PAX headers for
        // subsecond time resolution, but for now let's just capture
        // too long fields or non ascii characters
...
        tw.numeric(s.next(12), modTime, false, paxNone, nil)  // 136:148 --- consider using pax for finer granularity

So I guess we're only storing second-granularity modtimes.

@shanemhansen
Copy link
Contributor

The modtime for VERSION sticks out. Every other file in that archive has an mtime from yesterday.

shane@sherlock ~/foo $ ls -haln VERSION 
-rw-r--r-- 1 502 20 46 Dec 31  1969 VERSION

@bradfitz
Copy link
Contributor Author

If I dump the modtime details,

package main

import (
        "archive/tar"
        "compress/gzip"
        "fmt"
        "io"
        "log"
        "os"
        "time"
)

func main() {
        f, err := os.Open("/Users/bradfitz/Downloads/63e90c3fca85dc06294ed12e282c7901a7d7986b.tar.gz")
        if err != nil {
                log.Fatal(err)
        }
        zr, err := gzip.NewReader(f)
        if err != nil {
                log.Fatal(err)
        }
        tr := tar.NewReader(zr)
        for {
                h, err := tr.Next()
                if err == io.EOF {
                        break
                }
                if err != nil {
                        log.Fatal(err)
                }
                fmt.Printf("%v %v %s\n", h.ModTime.Format(time.RFC3339), h.FileInfo().Mode(), h.Name)
        }
}

And print them sorted,

$ go run dumptar.go | grep runtime | sort | less 
2015-10-18T10:27:23-07:00 drwxr-xr-x src/runtime/cgo/
2015-10-18T10:27:23-07:00 drwxr-xr-x src/runtime/debug/
2015-10-18T10:27:23-07:00 drwxr-xr-x src/runtime/pprof/
2015-10-18T10:27:23-07:00 drwxr-xr-x src/runtime/race/
2015-10-18T10:27:23-07:00 drwxr-xr-x src/runtime/race/testdata/
2015-10-18T10:27:23-07:00 drwxr-xr-x src/runtime/trace/
2015-10-18T10:27:30-07:00 -rw-r--r-- src/runtime/zversion.go
2015-10-18T10:27:30-07:00 drwxr-xr-x src/runtime/
2015-10-18T10:27:46-07:00 -rw-r--r-- pkg/darwin_amd64/runtime.a
...
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/error.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/export_arm_test.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/export_futex_test.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/export_linux_test.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/export_test.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/export_windows_test.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/extern.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/fastlog2.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/fastlog2_test.go
...
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/vlrt.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/wbfat.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/wbfat_gen.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/wincallback.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zcallback_windows.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zcallback_windows.s
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_386.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_amd64.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_amd64p32.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_arm.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_arm64.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_ppc64.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoarch_ppc64le.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_android.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_darwin.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_dragonfly.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_freebsd.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_linux.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_nacl.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_netbsd.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_openbsd.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_plan9.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_solaris.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/runtime/zgoos_windows.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/sync/runtime.go
2015-10-18T10:28:01-07:00 -rw-r--r-- src/sync/runtime_sema_test.go
2015-10-18T10:28:02-07:00 -rw-r--r-- pkg/darwin_amd64/runtime/cgo.a
2015-10-18T10:28:02-07:00 -rw-r--r-- test/runtime.go
2015-10-18T10:28:02-07:00 drwxr-xr-x pkg/darwin_amd64/runtime/

You see that runtime.a is older than its source files in src/runtime/.

We probably need to sanitize the modtimes somewhere when we generate or read tarballs.

@shanemhansen
Copy link
Contributor

Maybe a ghetto solution can be to touch .a files before archiving.

@bradfitz
Copy link
Contributor Author

Yeah, I verified that works and that's probably what I'll do, but I'd like to understand where the discrepancy comes from first. It's probably some timezone issue between the various servers involved in the build system.

@minux
Copy link
Member

minux commented Oct 19, 2015 via email

@bradfitz
Copy link
Contributor Author

Can't they all be touched to the same time?

@minux
Copy link
Member

minux commented Oct 19, 2015 via email

@bradfitz
Copy link
Contributor Author

There is no "system tar" on Windows and Plan 9.

And I don't see how changing the order of creating the tarball matters if the timestamps are wrong on on the filesystem already.

@shanemhansen
Copy link
Contributor

To me this bug looks like another manifestation of
#2775

I wonder if we're being caught by a regression. see the comment: "The mtime
hack appears to have stopped working in go1.5rc1."

On Mon, Oct 19, 2015 at 12:00 PM, Brad Fitzpatrick <[email protected]

wrote:

There is no "system tar" on Windows and Plan 9.

And I don't see how changing the order of creating the tarball matters if
the timestamps are wrong on on the filesystem already.


Reply to this email directly or view it on GitHub
#12979 (comment).

@rsc rsc added this to the Unreleased milestone Oct 23, 2015
@bradfitz
Copy link
Contributor Author

I'm writing a fix for this now and am testing. On my local machine during tests of the modtime verification and rewrite function, I found it triggering after a fresh local nuke + make.bash, with:

=== RUN   TestRewriteModtimes
2015/10/28 02:48:38 Input "src/runtime/zversion.go" (2015-10-28 02:42:22.129932984 +0000 UTC) after minout of "2015-10-28 02:42:17.09807515 +0000 UTC"
2015/10/28 02:48:38 Input "src/cmd/go/zdefaultcc.go" (2015-10-28 02:42:30.873685944 +0000 UTC) after minout of "2015-10-28 02:42:17.09807515 +0000 UTC"
2015/10/28 02:48:38 maxIn=2015-10-28 02:42:30.873685944 +0000 UTC out=[2015-10-28 02:42:17.09807515 +0000 UTC 2015-10-28 02:43:26.004128326 +0000 UTC] now=%!v(MISSING)
2015/10/28 02:48:38 true true false true true
...

... which is something I hadn't thought of before.

Fix ongoing.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16400 mentions this issue.

@bradfitz
Copy link
Contributor Author

Well, I deployed https://golang.org/cl/16400 and now every build fails on OS X. That's the opposite of the intended effect, but very interesting.

/cc @adg @randall77

@bradfitz
Copy link
Contributor Author

https://storage.googleapis.com/go-build-snap/go/darwin-amd64-10_10/86b0a658b27607d128471a7ac52fe0c11bfde97d.tar.gz is a snapshot of a build which just failed:

https://storage.googleapis.com/go-build-log/86b0a658/darwin-amd64-10_10_716f5279.log

But untarring it locally on my Mac, I no longer see the runtime as stale:

brad5k:go $ pwd
/tmp/go
brad5k:go $ tar -zxf ~/Downloads/86b0a658b27607d128471a7ac52fe0c11bfde97d.tar.gz 
brad5k:go $ export GOROOT=/tmp/go
brad5k:go $ $GOROOT/bin/go list -f '{{.Stale}}' runtime
false

... which seems like progress, at least.

And:

brad5k:go $ $GOROOT/bin/go tool dist test --no-rebuild
...
...
...
ALL TESTS PASSED

@crawshaw and I have lost access to some of the Mac minis racked in a Google datacenter. They're alive and restart when we push new binaries to GCS, but we can't ssh to them anymore. Because at least one of those machines was responsible for the failures recently, I wanted to reproduce there or check their timezones.

This is frustrating.

I think I'll just disable test sharding on OS X for now.

@bradfitz
Copy link
Contributor Author

Sent https://go-review.googlesource.com/16440 to disable OS X sharding for now.

bradfitz added a commit to golang/build that referenced this issue Oct 28, 2015
Workaround until the Mac failures are understood.

Updates golang/go#12979

Change-Id: I15b9ea8f4b708ebf9b7c6ad61e65d0f9eaaa6d73
Reviewed-on: https://go-review.googlesource.com/16440
Reviewed-by: David Crawshaw <[email protected]>
@bradfitz
Copy link
Contributor Author

Despite disabling sharding, I just saw this fail again on a trybot run from @aclements:

https://storage.googleapis.com/go-build-log/8e3ce754/darwin-amd64-10_10_26784189.log

ok      regexp  0.132s
ok      regexp/syntax   0.369s
--- FAIL: TestCgoCrashHandler (0.03s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
... (many) ...
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
--- FAIL: TestLargeStringConcat (0.00s)
    crash_test.go:110: Stale runtime.a. Run 'go install runtime'.
FAIL
FAIL    runtime 2.196s

Coordinator said: (confirming it only used one mac, "stadium2")

         2015-12-13T20:05:45Z get_buildlet 
   +0.0s 2015-12-13T20:05:45Z got_machine http://stadium2 reverse peer stadium2/208.52.191.51:51470 for modes [darwin-amd64-10_10]
   +0.1s 2015-12-13T20:05:45Z cleaned_up stadium2
   +0.0s 2015-12-13T20:05:45Z got_buildlet stadium2
   +0.2s 2015-12-13T20:05:45Z start_write_snapshot_tar 
   +5.9s 2015-12-13T20:05:51Z end_write_snapshot_tar 
   +0.0s 2015-12-13T20:05:51Z pre_exec 
   +0.1s 2015-12-13T20:05:51Z discovering_tests 
   +0.4s 2015-12-13T20:05:52Z starting_tests 194 tests
   +0.0s 2015-12-13T20:05:52Z start_tests stadium2: [go_test:archive/tar go_test:archive/zip go_test:bufio go_test:bytes go_test:compress/bzip2 go_test:compress/flate go_test:compress/gzip go_test:compress/lzw go_test:compress/zlib go_test:container/heap go_test:container/list go_test:container/ring go_test:crypto/aes go_test:crypto/cipher go_test:crypto/des go_test:crypto/dsa go_test:crypto/ecdsa go_test:crypto/elliptic go_test:crypto/hmac go_test:crypto/md5]
   +8.9s 2015-12-13T20:06:01Z end_tests stadium2: [go_test:archive/tar go_test:archive/zip go_test:bufio go_test:bytes go_test:compress/bzip2 go_test:compress/flate go_test:compress/gzip go_test:compress/lzw go_test:compress/zlib go_test:container/heap go_test:container/list go_test:container/ring go_test:crypto/aes go_test:crypto/cipher go_test:crypto/des go_test:crypto/dsa go_test:crypto/ecdsa go_test:crypto/elliptic go_test:crypto/hmac go_test:crypto/md5]; ok (test exec = 8.931867886s)
   +0.0s 2015-12-13T20:06:01Z start_tests stadium2: [go_test:crypto/rand go_test:crypto/rc4 go_test:crypto/rsa go_test:crypto/sha1 go_test:crypto/sha256 go_test:crypto/sha512 go_test:crypto/subtle go_test:crypto/tls go_test:crypto/x509 go_test:database/sql go_test:database/sql/driver go_test:debug/dwarf go_test:debug/elf go_test:debug/gosym]
  +17.7s 2015-12-13T20:06:18Z end_tests stadium2: [go_test:crypto/rand go_test:crypto/rc4 go_test:crypto/rsa go_test:crypto/sha1 go_test:crypto/sha256 go_test:crypto/sha512 go_test:crypto/subtle go_test:crypto/tls go_test:crypto/x509 go_test:database/sql go_test:database/sql/driver go_test:debug/dwarf go_test:debug/elf go_test:debug/gosym]; ok (test exec = 17.680212834s)
   +0.0s 2015-12-13T20:06:18Z start_tests stadium2: [go_test:debug/macho go_test:debug/pe go_test:debug/plan9obj go_test:encoding/ascii85 go_test:encoding/asn1 go_test:encoding/base32 go_test:encoding/base64 go_test:encoding/binary go_test:encoding/csv go_test:encoding/gob go_test:encoding/hex go_test:encoding/json go_test:encoding/pem go_test:encoding/xml go_test:errors go_test:expvar go_test:flag]
  +13.7s 2015-12-13T20:06:32Z end_tests stadium2: [go_test:debug/macho go_test:debug/pe go_test:debug/plan9obj go_test:encoding/ascii85 go_test:encoding/asn1 go_test:encoding/base32 go_test:encoding/base64 go_test:encoding/binary go_test:encoding/csv go_test:encoding/gob go_test:encoding/hex go_test:encoding/json go_test:encoding/pem go_test:encoding/xml go_test:errors go_test:expvar go_test:flag]; ok (test exec = 13.72867714s)
   +0.0s 2015-12-13T20:06:32Z start_tests stadium2: [go_test:fmt go_test:go/ast go_test:go/build go_test:go/constant go_test:go/doc go_test:go/format go_test:go/internal/gccgoimporter go_test:go/internal/gcimporter go_test:go/parser go_test:go/printer go_test:go/scanner go_test:go/token]
   +9.4s 2015-12-13T20:06:41Z end_tests stadium2: [go_test:fmt go_test:go/ast go_test:go/build go_test:go/constant go_test:go/doc go_test:go/format go_test:go/internal/gccgoimporter go_test:go/internal/gcimporter go_test:go/parser go_test:go/printer go_test:go/scanner go_test:go/token]; ok (test exec = 9.396741208s)
   +0.0s 2015-12-13T20:06:41Z start_tests stadium2: [go_test:go/types go_test:hash/adler32 go_test:hash/crc32 go_test:hash/crc64 go_test:hash/fnv go_test:html go_test:html/template go_test:image go_test:image/color go_test:image/draw go_test:image/gif go_test:image/jpeg]
  +11.4s 2015-12-13T20:06:53Z end_tests stadium2: [go_test:go/types go_test:hash/adler32 go_test:hash/crc32 go_test:hash/crc64 go_test:hash/fnv go_test:html go_test:html/template go_test:image go_test:image/color go_test:image/draw go_test:image/gif go_test:image/jpeg]; ok (test exec = 11.438762512s)
   +0.0s 2015-12-13T20:06:53Z start_tests stadium2: [go_test:image/png go_test:index/suffixarray go_test:internal/singleflight go_test:internal/trace go_test:io go_test:io/ioutil go_test:log go_test:log/syslog go_test:math go_test:math/big go_test:math/cmplx go_test:math/rand]
  +10.1s 2015-12-13T20:07:03Z end_tests stadium2: [go_test:image/png go_test:index/suffixarray go_test:internal/singleflight go_test:internal/trace go_test:io go_test:io/ioutil go_test:log go_test:log/syslog go_test:math go_test:math/big go_test:math/cmplx go_test:math/rand]; ok (test exec = 10.132701592s)
   +0.0s 2015-12-13T20:07:03Z start_tests stadium2: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable go_test:net]
  +14.2s 2015-12-13T20:07:17Z end_tests stadium2: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable go_test:net]; ok (test exec = 14.193246024s)
   +0.0s 2015-12-13T20:07:17Z start_tests stadium2: [go_test:net/http]
  +22.2s 2015-12-13T20:07:39Z end_tests stadium2: [go_test:net/http]; ok (test exec = 22.187587764s)
   +0.0s 2015-12-13T20:07:39Z start_tests stadium2: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi go_test:net/http/httptest go_test:net/http/httputil go_test:net/http/internal go_test:net/internal/socktest go_test:net/mail go_test:net/rpc go_test:net/rpc/jsonrpc go_test:net/smtp go_test:net/textproto]
  +13.0s 2015-12-13T20:07:52Z end_tests stadium2: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi go_test:net/http/httptest go_test:net/http/httputil go_test:net/http/internal go_test:net/internal/socktest go_test:net/mail go_test:net/rpc go_test:net/rpc/jsonrpc go_test:net/smtp go_test:net/textproto]; ok (test exec = 12.980615017s)
   +0.0s 2015-12-13T20:07:52Z start_tests stadium2: [go_test:net/url go_test:os go_test:os/exec go_test:os/signal go_test:os/user go_test:path go_test:path/filepath]
  +11.2s 2015-12-13T20:08:04Z end_tests stadium2: [go_test:net/url go_test:os go_test:os/exec go_test:os/signal go_test:os/user go_test:path go_test:path/filepath]; ok (test exec = 11.159909722s)
   +0.0s 2015-12-13T20:08:04Z start_tests stadium2: [go_test:reflect go_test:regexp go_test:regexp/syntax]
   +6.4s 2015-12-13T20:08:10Z end_tests stadium2: [go_test:reflect go_test:regexp go_test:regexp/syntax]; ok (test exec = 6.383222013s)
   +0.0s 2015-12-13T20:08:10Z start_tests stadium2: [go_test:runtime]
   +9.4s 2015-12-13T20:08:19Z end_tests stadium2: [go_test:runtime]; test failed remotely (test exec = 9.361608336s)
   +0.0s 2015-12-13T20:08:19Z start_tests stadium2: [go_test:runtime/debug go_test:runtime/internal/atomic go_test:runtime/pprof go_test:runtime/trace go_test:sort]
   +0.0s 2015-12-13T20:08:19Z done with test failures
   +0.0s 2015-12-13T20:08:19Z end_tests stadium2: [go_test:runtime/debug go_test:runtime/internal/atomic go_test:runtime/pprof go_test:runtime/trace go_test:sort]; commErr=buildlet: Client closed (test exec = 30.790433ms)
   +0.0s 2015-12-13T20:08:19Z main_buildlet_broken stadium2

The mystery grows.

@aclements
Copy link
Member

Are we sure this is related to the builders and not generally to Darwin? For example, part of determining staleness is comparing mtimes, and HFS+ mtimes have only 1 second resolution. When you disabled sharding, the runtime didn't import any packages, so all that mattered was how the source mtimes compared with the runtime.a mtime and the sources were unpacked well before the runtime was built. Now, however, the runtime imports two internal packages that were probably built a fraction of a second before the runtime itself. They could easily have equal mtimes on HFS+, though that won't cause staleness. However, if mtimes went slightly backwards (NFS adjustment?), it could trigger this.

@bradfitz
Copy link
Contributor Author

Hopefully it's a problem like that. I'm glad to see this happen on a single machine rather than N.

Perhaps we can add some strategically-placed time.Now() snapshots in the runtime tests to include in the error message, to catch if time is ever going backwards?

@aclements
Copy link
Member

How about just running a test loop on a trybot for a while? It may actually need to interact with the file system and see what mtimes it gets.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/18085 mentions this issue.

@bradfitz
Copy link
Contributor Author

bradfitz commented Sep 7, 2016

Now that we have Mac capacity again, I'm going to re-enable OS X trybot sharding and see what happens here.

/cc @josharian

gopherbot pushed a commit to golang/build that referenced this issue Sep 7, 2016
Ignore the old darwin-{amd64,386}-10_10 builders. Don't give them an
error, but pretend they don't exist.

Also: switch trybots from OS X 10.10 to OS X 10.11, and re-enable
sharding. Let's hope for the best. See golang/go#12979.

This also enables subrepo tests for all OS X versions.

darwin-386-* is currently offline, pending some golang/go#17009

Updates golang/go#9495 (OS X virtualization)

Change-Id: I4d53a79087404b5e8051d1aff0c668a92625f442
Reviewed-on: https://go-review.googlesource.com/28583
Reviewed-by: Brad Fitzpatrick <[email protected]>
@bradfitz
Copy link
Contributor Author

2 weeks and seems fine. Closing.

@golang golang locked and limited conversation to collaborators Sep 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

7 participants