-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: goroutine leak in newMultipartReader #1380
Comments
Thank you for reporting this issue @chessman and welcome to the Googleapis Go cloud project! Great details and thank you for that backtrace. So many years ago I encountered such problems with an unwieldy number of concurrent uploads that saturated the networks of my users of https://github.com/odeke-em/drive. While I am still speculating, I suggest that perhaps you might also want to set Writer.ChunkSize explicitly as per https://godoc.org/cloud.google.com/go/storage#Writer.ChunkSize to avoid having to send all the bytes in a single request per upload which will not retry uploads if it fails /cc @frankyn for advice. By leaking "memory" could you please provide some more information? Also do those goroutines finally disappear after sometime? A period of 60 minutes seems like a max timeout for an upload was reach and a retry was started. In https://gist.github.com/odeke-em/c51e2dd5979964cb19a6fddf2bd53976 or inlined below I've composed a program that perhaps might help examine the concurrent uploads/sec and the bandwidth you might be experience
Source code
package main
import (
"context"
"fmt"
"io"
"log"
"math/rand"
"strings"
"time"
"cloud.google.com/go/storage"
)
func main() {
ctx := context.Background()
client, err := storage.NewClient(ctx)
if err != nil {
log.Fatalf("Failed to create storage client: %v", err)
}
defer client.Close()
projectID := "<your_project_id>" // TODO: Fill me in.
bucket := client.Bucket("<your_bucket>") // TODO: Fill me in.
// Ensure the bucket is created firstly
_ = bucket.Create(ctx, projectID, nil)
rng := rand.New(rand.NewSource(10))
i := uint64(1)
reporterCh := make(chan *report)
go compileStats(reporterCh)
n := 100
sema := make(chan bool, n)
for {
size := int((0.5+rng.ExpFloat64())*500) * 1 << 10 // At least 250kB
go func(id uint64, size int) {
objectName := fmt.Sprintf("object-%d.txt", id)
object := bucket.Object(objectName)
w := object.NewWriter(ctx)
defer func() {
w.Close()
<-sema
}()
s := "A"
if id%2 == 0 {
s = "B"
}
n, _ := io.Copy(w, strings.NewReader(strings.Repeat(s, size)))
reporterCh <- &report{size: n, id: id}
}(i, size)
sema <- true
i += 1
}
}
type report struct {
size int64
id uint64
}
func compileStats(reports <-chan *report) {
i := uint64(0)
startTime := time.Now()
totalBytes := uint64(0)
for report := range reports {
i += 1
secs := time.Since(startTime).Seconds()
filesPerSec := float64(i) / secs
totalBytes += uint64(report.size)
mbPerSec := float64(totalBytes>>20) / secs
fmt.Printf("%s: File#: %d Files/sec: %.2ffiles/s %.3fMbps\r", time.Since(startTime).Round(time.Millisecond), i, filesPerSec, mbPerSec)
}
} |
Thanks for quick response @odeke-em . I've reproduced it with the test program. It was running for 15 hours on a freshly created instance within the same region. Speed:
Available memory decreased by 800 MB (
|
The test was running for a week. 1500 MB memory is leaked. There are 132 media buffers:
Some routines are hanging too long:
You can find goroutines and allocs dumps in attachment: |
My apologies for the late reply @chessman! There was an almost 2 week freeze in between but am back now. Let me take a look at the logs. However, goroutines getting stuck on a select is very odd/bizarre perhaps this could be a scheduler problem. Let me examine your logs. |
cc @frankyn |
I took sometime to finish up my repro and hypothesis about perhaps the HTTP/2 frontend stalling reads which would explain why there are goroutines "hanging forever" and you can see that from https://gist.github.com/odeke-em/b62737b89b91e71ffbf0545581976cbc package main
import (
"bytes"
"context"
"crypto/tls"
"fmt"
"io"
"io/ioutil"
"log"
"mime/multipart"
"net/http"
"net/http/httptest"
"net/textproto"
"net/url"
"os"
"syscall"
"time"
"golang.org/x/net/http2"
)
func main() {
cst := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Stalled server that takes forever to read or could be overloaded.
// It already established a connection though.
<-time.After(10 * time.Minute)
slurp, _ := ioutil.ReadAll(r.Body)
log.Printf("Request payload: %s\n", slurp)
w.Write(bytes.Repeat([]byte("a"), 3000))
}))
if err := http2.ConfigureServer(cst.Config, nil); err != nil {
log.Fatalf("http2.ConfigureServer: %v", err)
}
cst.StartTLS()
defer cst.Close()
tlsConfig := &tls.Config{InsecureSkipVerify: true}
u, _ := url.Parse(cst.URL)
tlsConn, err := tls.Dial("tcp", u.Host, tlsConfig)
if err != nil {
log.Fatalf("Failed to create a tls connection: %v", err)
}
prc, pwc := io.Pipe()
go func() {
h := make(textproto.MIMEHeader)
mpw := multipart.NewWriter(pwc)
w, err := mpw.CreatePart(h)
if err != nil {
mpw.Close()
pwc.CloseWithError(fmt.Errorf("CreatePart failed: %v", err))
return
}
n, _ := pwc.Write(bytes.Repeat([]byte("a"), 39<<20))
println("read ", n)
r := bytes.NewReader([]byte(`{"id": "1380", "type": "issue"}`))
if _, err := io.Copy(w, r); err != nil {
mpw.Close()
pwc.CloseWithError(fmt.Errorf("Copy failed: %v", err))
return
}
println("done read in goroutine")
mpw.Close()
pwc.Close()
}()
tr := &http2.Transport{TLSClientConfig: tlsConfig}
cc, err := tr.NewClientConn(tlsConn)
if err != nil {
log.Fatalf("(*http2.Transport).NewClientConn: %v", err)
}
// Find our own process and in the background send ourselves SIGQUIT.
selfProcess, err := os.FindProcess(os.Getpid())
if err != nil {
log.Fatalf("Failed to find own process: %v", err)
}
go func() {
<-time.After(6 * time.Second)
if err := selfProcess.Signal(syscall.SIGQUIT); err != nil {
log.Fatalf("Failed to send self SIGQUIT: %v", err)
}
}()
// Send that ping frame and ensure we have an established connection
// and that the server is one stalled and body reads are stalled.
if err := cc.Ping(context.Background()); err != nil {
log.Fatalf("(*http2.ClientConn).Ping: %v", err)
}
req, _ := http.NewRequest("GET", cst.URL, prc)
res, err := cc.RoundTrip(req)
if err != nil {
log.Fatalf("http.Transport.Roundtrip error: %v", err)
}
defer res.Body.Close()
blob, _ := ioutil.ReadAll(res.Body)
log.Printf("%s\n", blob)
} and that produces pretty much an identical stack trace in there of goroutine 51 [select]:
io.(*pipe).Write(0xc000166500, 0xc0001ae090, 0x42, 0x82, 0x0, 0x0, 0x0)
/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/io/pipe.go:87 +0x1dc
io.(*PipeWriter).Write(0xc000162020, 0xc0001ae090, 0x42, 0x82, 0x12f9620, 0x13476e0, 0x10cfc01)
/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/io/pipe.go:153 +0x4c
bytes.(*Buffer).WriteTo(0xc0000969c0, 0x13be9c0, 0xc000162020, 0x46350b0, 0xc0000969c0, 0x1)
/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/bytes/buffer.go:242 +0xb5
io.copyBuffer(0x13be9c0, 0xc000162020, 0x13be780, 0xc0000969c0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0)
/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/io/io.go:384 +0x33f
io.Copy(...)
/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/io/io.go:364
mime/multipart.(*Writer).CreatePart(0xc000096990, 0xc000064e48, 0x0, 0x0, 0x0, 0x0)
/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/mime/multipart/writer.go:121 +0x3fa
main.main.func2(0xc000162020)
/Users/emmanuelodeke/Desktop/openSrc/bugs/google-cloud-go/1380/main.go:51 +0x11f
created by main.main
/Users/emmanuelodeke/Desktop/openSrc/bugs/google-cloud-go/1380/main.go:48 +0x3d3 which is pretty identical to that in #1380 (comment) (accounting for code changes between Go1.11 and tip aka Go1.13-upcoming) goroutine 15570455 [select, 794 minutes]:
io.(*pipe).Write(0xc0003e2000, 0xc015038aa0, 0x62, 0xa0, 0x0, 0x0, 0x0)
/usr/lib/go-1.11/src/io/pipe.go:87 +0x1cc
io.(*PipeWriter).Write(0xc000310e78, 0xc015038aa0, 0x62, 0xa0, 0x915fa0, 0x99e520, 0xdf6a01)
/usr/lib/go-1.11/src/io/pipe.go:153 +0x4c
bytes.(*Buffer).WriteTo(0xc000264770, 0xa3ce40, 0xc000310e78, 0x7f54ecae6f58, 0xc000264770, 0xc0418c1d01)
/usr/lib/go-1.11/src/bytes/buffer.go:241 +0xb6
io.copyBuffer(0xa3ce40, 0xc000310e78, 0xa3c5a0, 0xc000264770, 0x0, 0x0, 0x0, 0x2, 0xc0418d4000, 0x0)
/usr/lib/go-1.11/src/io/io.go:384 +0x352
io.Copy(0xa3ce40, 0xc000310e78, 0xa3c5a0, 0xc000264770, 0x0, 0x0, 0x0)
/usr/lib/go-1.11/src/io/io.go:364 +0x5a
mime/multipart.(*Writer).CreatePart(0xc00496bf50, 0xc039af10e0, 0xc039af10e0, 0xa426e0, 0x9b7d79, 0x9a000)
/usr/lib/go-1.11/src/mime/multipart/writer.go:115 +0x3e5 @frankyn and Google Cloud Storage team, let's also examine the server side. Maybe this could also be an old x/net/http2 bug as well, but that identical repro above might have a clue towards being a server issue. |
Sorry for the late response.
I've tried it with Go 1.12.5 with the same result.
Project ID: datomia-2 I've just started the reproduction and will stop it tomorrow.
I'm using a VM instance on Compute Engine.
That's a huge amount of logs and I don't know how to identify a failure.
|
cc @frankyn |
Hi @odeke-em, IIUC, your current hypothesis is that the Google Frontend is stalling HTTP/2 communication due to flooding or similar issues given the example provided in #issuecomment-493830375. Is your current ask to relay a question on limits to the Storage team for more input? I appreciate your patience thank you! |
@frankyn, yes in deed! In that example I simulated a stalled read and when we get a core-dump, we get an almost identical stacktrace to that seen when a read goroutine stalled in the original reporter's issue.
Perhaps, I wanted to maybe pair up with you and try to figure out how/when GFE could stall reads and if there are such reported cases during flooding. Thank you for chiming in. |
Following up through email. |
@chessman unfortunately it doesn't seem to be something that this package can do and there is a similar bug that was reported in the net/http library that'll perhaps/hopefully be fixed for Go1.14 and we are following up here golang/go#29246 (comment) Please follow on the Go net/http bug instead. |
Client
We have 100 concurrent uploads, each object is about 500 KB.
Describe Your Environment
Go 1.11.4
CentOS 7
Expected Behavior
No leaked goroutines.
Actual Behavior
We observe that memory is leaking over time and there are stalled goroutines in pprof.
UPDATE: It seems that these routines appear each hour. For example, if the application is running for 204 minutes then there are stalled goroutines in pprof with durations 24, 84, 144 minutes.
The text was updated successfully, but these errors were encountered: