[Problem] Application stuck after running in "connect - upload file - disconnect" mode #89

paskal · 2021-06-19T20:19:25Z

paskal
Jun 19, 2021

Hi there, first of all, thanks for making mtproto available in native Go!

I have an MR umputun/feed-master#37, which is a real-life scenario that stuck after I run it. Here is pprof, here is a stack trace of a running application.

Desired behaviour

On new RSS item appearance, the application should connect to Telegram server (as there is no working reconnection in github.com/xelaj/mtproto v1.0.0, nether in v1.0.1 according to #79, but v1.0.1 also lost control to catch panics and crashes the whole application), upload the file and send it in the message with the media, disconnect from the server and repeat the cycle for the next RSS item.

As an observer, I would see new telegram messages when new RSS items are appearing.

Observed behaviour

On the new RSS item appearance, with some fairly high probability application "chokes" on some item and stops responding. In the stack trace I see a lot of these:

As an observer, I see that few messages pass through to the telegram but afterwards, all items processing logic is stuck likely within the telegram message posting logic (telegram.go) as that's the only place in the program with the logic changed.

Probable cause

I don't know for sure yet (that is my question), but I suspect it might be the warning channel handling code, which I had to write as without warning handling library throws panics I can't catch.

I would be extremely grateful if someone will help me with this issue and trace the origin of the stuck application or the goroutine leak if that's what I observe.

paskal · 2021-06-19T20:41:58Z

paskal
Jun 19, 2021
Author

I found that v1.0.0 MTProto.Stop() calls m.routineswg.Wait() that MTProto.Disconnect() doesn't, I'll try to use it instead and see if the situation persists.

At the time before restart I have 127 dangling goroutines like that one spawned within last ~8h:

goroutine 798 [chan receive, 153 minutes]:
github.com/xelaj/go-dry.(*CancelableReader).begin(0xc0001240c0)
	/build/feed-master/vendor/github.com/xelaj/go-dry/io.go:144 +0x98
created by github.com/xelaj/go-dry.NewCancelableReader
	/build/feed-master/vendor/github.com/xelaj/go-dry/io.go:193 +0xe9

0 replies

quenbyako · 2021-06-19T20:55:35Z

quenbyako
Jun 19, 2021
Maintainer

@paskal Thanks a lot for full explanation and this data, i really appreciate your work!

Looks like there are more problems in code than only messages are stuck somewhere, but also probable we have channel like, which looks like catastrophic level (that explains, why our internal project required force restarting after few days of working).

So, i think that we have more problems with tcp connection handling (if i understand correctly, literally all problems, include panics, goroutines leak, channels leak, etc etc etc), and that is really important bug.

My idea how to fix it: we need to split core package (mtproto) to 3 separated submodules: transport, which will handle sending and receiving messages to different transports like tcp, websockets (like i started implementing for #78 issue), http, etc. cause currently i see that tcp connection is the most unstable in all package; submodule session which will handle all session stuff, including the oldest fuckin bug (durov brothers, i hate you for seqno, this is the worst way to guarantee that message recieved) about badSeqNoError, and last one is exact mtproto, for the all stuff related to creating session, handling responces etc.

But, @paskal please understand me, i currently don't have enough time for this project, our team was splitted to another different projects, and, as you can see, only me left to maintain mtproto. even though i have a lot of my main work, so i need help with implementations and especially with docs, cause the main reason why still no one accepted maintainer status except me is the implementation is too hard to understand it, so i am required to well document it, and, guess what, i don't have time to do that.

5 replies

paskal Jun 19, 2021
Author

Feel free to ping me in the telegram. It might be possible that I could help you with some of the problems. It sounds like one level deeper than whatever I work on now; however, it doesn't seem too complicated.

paskal Jun 20, 2021
Author

@quenbyako I think that switching from Disconnect to Stop didn't help my case, after a couple of items uploaded to the Telegram, I see more than a thousand of xelaj/go-dry.NewCancelableReader goroutines in the pprof stacktrace: pastebin.

paskal Jun 20, 2021
Author

The problem is this piece of code in go-dry used by v1.0.0 I believe:

func (c *CancelableReader) begin() {
	for {
		sizewant := <-c.sizeWant
		buf := make([]byte, sizewant)
		// readFull, cause some readers like net.TCPConn returns size smaller than buf size
		n, err := io.ReadFull(c.r, buf)
		if err != nil {
			c.err = err
			close(c.data)
			return
		}
		if n != sizewant {
			panic("read " + strconv.Itoa(n) + ", want " + strconv.Itoa(sizewant))
		}
		c.data <- buf
	}
}

c.sizeWant is never closed and so goroutine with that function stays forever once spawned. @quenbyako do you think it would be possible for you to fix that small issue and release v1.0.1 with only that change on top of v1.0.0 code? I could prepare the MR if you want, however, the tricky part is that I need it working in v1.0.0 and not

paskal Jun 21, 2021
Author

Created a PR with the fix, xelaj/go-dry#3

quenbyako Jun 21, 2021
Maintainer

not sure, but i think the main problem is how mtproto works with tcp connection: cancelable reader can be implemented through short timeouts (and this is better solution, cause go-dry implementation can be blocked if Reader never returns any result like if we create tcp connection and other side never returns any data), so those timeouts can be setted to conn with SetReadTimeout(), but i really worry about resources consumption, cause loop every 10-20 milliseconds doesn't sounds as good idea to me.

I still researching, how to check that tcp.Conn has any data, and if it has, then do something, but maybe I'm doing something clever and I shouldn't overcomplicate the logic of canceling a read from a tcp connection

quenbyako · 2021-06-19T20:57:02Z

quenbyako
Jun 19, 2021
Maintainer

@paskal even though, i will help you as much as i can, so i won't abandon this project at least until one year

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Problem] Application stuck after running in "connect - upload file - disconnect" mode #89

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Problem] Application stuck after running in "connect - upload file - disconnect" mode #89

paskal Jun 19, 2021

Desired behaviour

Observed behaviour

Probable cause

Replies: 3 comments · 5 replies

paskal Jun 19, 2021 Author

quenbyako Jun 19, 2021 Maintainer

paskal Jun 19, 2021 Author

paskal Jun 20, 2021 Author

paskal Jun 20, 2021 Author

paskal Jun 21, 2021 Author

quenbyako Jun 21, 2021 Maintainer

quenbyako Jun 19, 2021 Maintainer

paskal
Jun 19, 2021

Replies: 3 comments 5 replies

paskal
Jun 19, 2021
Author

quenbyako
Jun 19, 2021
Maintainer

paskal Jun 19, 2021
Author

paskal Jun 20, 2021
Author

paskal Jun 20, 2021
Author

paskal Jun 21, 2021
Author

quenbyako Jun 21, 2021
Maintainer

quenbyako
Jun 19, 2021
Maintainer