-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builders: run Plan 9 builders on GCE #9491
Comments
@0intro, the Plan 9 VM comes up on GCE and I see it say:
So it appears to be running the But the The OpenBSD builder used this exact same config and worked fine, so I think this is a Plan 9 issue. |
Yes, it seems that the Plan 9 networking is up, but only for the external interface, not the internal interface. Inspecting it from a VM on the same VLAN:
|
Yes, I've not configured the second interface. You should try to add the following lines in the termrc:
I've updated the script in CL 2251. |
Didn't work:
|
I'm quite surprised. It works fine with two network controllers in QEMU.
I'll try on GCE and investigate further. Brad, is there anything notable in the GCE console output, |
Sorry, I don't have the output anymore after I deleted the instance. I'm now working on Windows. :( If you can test on GCE, that'd be great. |
Don't worry, I'll investigate in my side. |
Just to be sure. In your network configuration, is there two network In your example, are 23.236.55.181 and 10.240.80.201 both configured I'm not sure to understand the topology of your network. |
The default GCE network, whatever that is. When I go to console.developers.google.com and then my project, then "Compute", then "Networks", in "All networks" I see only 1 network, named "default", with Address 10.240.0.0/16 and gateway 10.240.0.1. Despite that, Linux and OpenBSD and Windows all let me use the machine's internal (10.*) address, or its public internet address. |
OK, so there is only one network controller and one address It looks like a routing issue on my side. This is surprising, |
I can reproduce the issue on my side. The machine is accessible The route table looks like:
There is a default route via 10.240.0.1 and a local route I'll investigate further tomorrow. |
I had to remove the local route to make it work. Apparently, they expect us to pass through 10.240.0.1 to
They use the same configuration on the Linux instances (no local route):
Brad, I've updated the make script in CL 2251, so it should be fine now. |
Now it gets further (I could get the buildlet running and the coordinator doing builds!), but outgoing DNS doesn't appear to work (the http://godoc.org/google.golang.org/cloud/compute/metadata#OnGCE function reports false, but it should return true). Once I hard-coded that to return true on plan9, I then get:
|
And more:
|
@0intro, are you able to run Go's |
... because keep in mind that I'm getting the results above via the |
Yes, I figured out what was the issue. The buildlet was started before the machine was properly configured. I'll fix the make script once I'll be back home. |
It takes the build a couple minutes to hit its first error. Does the buildlet inherit the bad environment from the moment it starts? i.e. is the network configuration part of the environment, or is it global across the whole machine? Because if it's global (like Linux, etc), then by the time the buildlet is actually running tests, the network should already be fixed by your start-up scripts. |
The loopback is set up in
So,
I'm not sure about the DNS issue however. I call the DNS
However, when connected on my Plan 9 instance on GCE,
|
Please give a try to this script and let me know. http://www.9legacy.org/9legacy/doc/gce/make.bash I'll update the CL later. |
I've updated the make script in CL 2251. |
The networking issues all seem fixed now, but all.rc still fails with a hang in the runtime tests. Looks like it's blocking forever waiting on a Read system call, or waiting for a process to end: http://build.golang.org/log/917ac618c9c8698a4ce9cd2897ee62e44577ccbd Have you successfully run all.rc on GCE on the /tmp filesystem? Is that a different filesystem implementation than other places? |
I've experienced a similar issue when running all.rc in GCE. In this configuration, /tmp is running Fossil, like the rest of the file system. I wonder if something is really blocked or if the test could I'll investigate. By the way, is the OnGCE() function working now? |
Yes, it seems like it is. |
The runtime test passes in my QEMU, but not in my g1-small GCE instance. Brad, what kind of GCE instance are you running? |
n1-highcpu-2 now. Previously (when the net tests were failing) I was using n1-highcpu-4. |
Yes, bug369 is failing. It also fails in the old builders. This is issue #9428. I wouldn't say I got better performance. It's astoundingly slow on GCE. |
The Plan 9 port on GCE is very new. I'm sure there is still room for improvement. Now that the Plan 9 trybot is working (even partially), I think it allows |
8m was indeed a bit short to run the parallel runtime test.
|
Just for the record, I ran all.rc in a n1-highcpu-2 instance this night.
As a comparison, on the 386 hardware builder (not using ramfs).
And, the same disk image as on GCE, but running on QEMU on my laptop (using ramfs):
Here is the full output on n1-highcpu-2:
|
Can we get an amd64 builder on GCE too now? |
Yes, I will try to get two Plan 9 amd64 kernels working:
I'll start by 9k. However, there is an issue that prevents to Then, I'll do pc64. This is the same kernel as the one |
This resulted in our first "ok" on the dashboard for Plan 9 with the buildlet, in 19 minutes. It only runs the std tests, and nothing else after that. Update golang/go#9491 Change-Id: Iad77a594f83bfd3fa72596bcc3057645d9c9bb4c Reviewed-on: https://go-review.googlesource.com/2523 Reviewed-by: Andrew Gerrand <[email protected]>
... because it's not running all the tests. Updates golang/go#9491 Change-Id: I2f3e8d1c2cba1b014d59cd3adfe5e04bd5f74dae Reviewed-on: https://go-review.googlesource.com/2524 Reviewed-by: Andrew Gerrand <[email protected]>
Update golang/go#9491 Change-Id: I219e2e071c0f58bf8c2b69c57b96a9114773c7b7 Reviewed-on: https://go-review.googlesource.com/2251 Reviewed-by: Brad Fitzpatrick <[email protected]>
This resulted in our first "ok" on the dashboard for Plan 9 with the buildlet, in 19 minutes. It only runs the std tests, and nothing else after that. Update golang/go#9491 Change-Id: Iad77a594f83bfd3fa72596bcc3057645d9c9bb4c Reviewed-on: https://go-review.googlesource.com/2523 Reviewed-by: Andrew Gerrand <[email protected]>
... because it's not running all the tests. Updates golang/go#9491 Change-Id: I2f3e8d1c2cba1b014d59cd3adfe5e04bd5f74dae Reviewed-on: https://go-review.googlesource.com/2524 Reviewed-by: Andrew Gerrand <[email protected]>
I can't fix the Plan 9 builders on GCE because I can't get build artifacts out of them due to Plan 9's networking and/or MTU problems. Until I can get a snapshot of the Go 1.4 tree out of the Plan 9 builder, I will be disabling the Plan 9 builder. Please let me know when the Plan 9 networking problems are resolved. That is, let me know when you can download a 2KB+ file from Plan 9 on GCE to another machine (on GCE or outside, but ideally both) |
What really puzzles me is I don't have any issue downloading % hget http://plan9.bell-labs.com/plan9/download/plan9.iso.bz2 > plan9.iso.bz2 The issue I encountered is when I try to download a file located on the |
I think MTU path discovery is why you can download files into your GCE VM. The sending side (bell-labs.com) knows what size to send you. Maybe GCE replies with the ICMP responses to tell it the proper sizes. But when your VM tries to send packets, if it writes one too large, it disappears. The Linux VM gets the MTU of 1460 via a DHCP extension. Plan 9's DHCP doesn't do that. But you said you tweaked another host's MTU to make it work? That doesn't scale. :) We have to figure out how to tell Plan 9 to write the correct size IP packets. Also, that doesn't even sound like it works: I failed to download a file from Plan9->Linux when both were on GCE, with the Linux MTU set to 1460 and Plan 9 also thought it was 1460. I'm increasingly thinking the MTU isn't the problem by itself and Plan 9 has some other networking issues happening on GCE. |
Update! We tweaked Plan 9's TCP stack's DEF_MSS down 40 bytes to 1420 and rebuilt the kernel and things work now. @0intro will now send a CL updating the kernel image. |
In this new image, we changed the kernel to decrease the TCP MSS from 1460 to 1420 to take into account GCE overhead and prevent IP fragmentation. Also change make.bash to set MTU to 1460 during startup. Thanks to Brad for the help debugging. Update golang/go#9491 Change-Id: I3261fc85e8324d56bc1aa0180e2a40a969137eef Reviewed-on: https://go-review.googlesource.com/3161 Reviewed-by: Brad Fitzpatrick <[email protected]>
We counted our chickens before they hatched and sold the bear's skin before killing it. It still seems flaky and hangs. I can't get the 30 MB Go 1.4 tar.gz out of it. |
If you guys are still facing path MTU discovery issues, it might not be a simple path MTU issue at IP-layer but might be a packetization-layer (in this case TCP) path MTU discovery issue. Is Plan 9 able to clamp/tweak/adjust outgoing TCP MSS options in SYN segments? Doesn't Plan 9 support any TCP options including MSS/WScale/SACK/etc? Perhaps you need to count the size of TCP options. What happens if DEF_MSS of Plan 9 takes a value between 1280 and 1400?
|
Update golang/go#9491 Update golang/go#8639 Change-Id: I1f43c751777f10a6d5870ca9bbb8c2a3430189bf Reviewed-on: https://go-review.googlesource.com/3170 Reviewed-by: Andrew Gerrand <[email protected]>
;) If it was true, we couldn't traverse across IP islands using IPv6. Yup, encapsulation and/or translation gears do IP-layer packet fragmentation and reassembly even on IPv6 packet forwarding plane. |
This new image is based on the latest Plan 9 CD image (2015-01-10) and includes two fixes. The ethervirtio driver was fixed to prevent 10 trailing bytes in received frames. The TCP stack was fixed to calculate the MSS properly in incoming connections. Update golang/go#9491 Change-Id: Ic42378ffbd5fc43e664632b2c5923b84aac5f639 Reviewed-on: https://go-review.googlesource.com/3280 Reviewed-by: Brad Fitzpatrick <[email protected]>
I think we can close this now. Things don't run super quickly, but things work now. Thanks to @0intro for all the work! |
Thanks @bradfitz for the help debugging. |
This new image provides a new tool, called aux/randfs. This is a pseudo-random file system which can be mounted on top of /dev/random. We noticed the Go binaries on GCE were running much slower than usual. It happened because the runtime was regularly waiting after /dev/random, which wasn't able to produce random bytes as fast as required. A simple workaround was to write a pseudo-random file system, acting as /dev/random and initialized from a true random source. It will be able to produce much more bytes per second than the random generator provided by the kernel. Update golang/go#9491 Change-Id: I81158adef7332d10295c2245b4564ae18ebb3a14 Reviewed-on: https://go-review.googlesource.com/6170 Reviewed-by: Brad Fitzpatrick <[email protected]>
This bug is about running the Plan 9 builders on GCE with the buildlet.
The text was updated successfully, but these errors were encountered: