Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket internally errors "/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out" #140

Open
Mehgugs opened this issue Jan 31, 2019 · 33 comments

Comments

@Mehgugs
Copy link
Contributor

Mehgugs commented Jan 31, 2019

This happens when receiving payloads from the websocket.

Shard-0 has stopped receiving: ("/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out\
stack traceback:\
\9[C]: in function 'assert'\
\9/usr/local/share/lua/5.3/http/websocket.lua:282: in function 'http.websocket.read_frame'\
\9/usr/local/share/lua/5.3/http/websocket.lua:394: in function </usr/local/share/lua/5.3/http/websocket.lua:386>\
\9[C]: in function 'xpcall'\
\9...s/marcus/documents/github/novus/lua/novus/shard/init.lua:183: in function 'novus.shard.messages'\
\9[C]: in function 'xpcall'\
\9.../marcus/documents/github/novus/lua/novus/client/init.lua:42: in function <.../marcus/documents/github/novus/lua/novus/client/init.lua:41>"

This happens regardless of if a timeout is passed to the socket.

The weird thing is when I request deflated messages (per message inflate/deflate inside my handling loop), I dont get this issue at all. The only time this happens is when I request raw json payloads. I'm using scm-0 from master (8ab5c30) to fix the timeout issues on the release rock.

This might be due to the payload size, heuristically I know it's the biggest payload I'm likely to receive causing this error, however this is not an issue with other clients for the api I'm connecting to.

@Mehgugs Mehgugs changed the title Websocket internally errors "/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out"" Websocket internally errors "/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out" Jan 31, 2019
@daurnimator
Copy link
Owner

Could you run with the environment variable SOCKET_DEBUG=1 set?
I'm wondering if your server is actually sending the data.

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Feb 2, 2019

With SOCKET_DEBUG=1 it works correctly, and logs the payloads I expect to see, it only errors without the debug variable set.

@daurnimator
Copy link
Owner

daurnimator commented Feb 2, 2019

it only errors without the debug variable set.

That's extremely odd. The debug variable should only add logging, not change any behaviour.
Maybe there is a race condition somewhere, and the extra time spent logging is changing the winner?

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Feb 2, 2019

That seems likely, is there a way for me to collect the debug data into a file?

@daurnimator
Copy link
Owner

is there a way for me to collect the debug data into a file?

Just pipe/redirect it?

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Feb 2, 2019

https://ghostbin.com/paste/a6czk You can see it cut off reading here. It is crashing consistently it seems upon further inspection. I think I jumped the gun a little when it managed to reconnect previously.

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Feb 2, 2019

I added a xpcall handler to inspect the internals here is an example of a frame which errors:

{
    "RSV2": false,
    "RSV3": false,
    "RSV1": false,
    "MASK": false,
    "length": 12964,
    "FIN": true,
    "opcode": 1
}

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Feb 21, 2019

Any idea what could cause this? When this happens my receive loop breaks down and I dont know if i can handle it without reconnecting.

@daurnimator
Copy link
Owner

Your pastebin link has now expired. Could you repost it? (consider using a gist?)

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Mar 8, 2019

I'll try get some more data (I've lost the old log). What seems to happen is that the websocket cannot read the frame in 0 timeout https://github.com/daurnimator/lua-http/blob/master/http/websocket.lua#L282 I've made this line retry after a cqueues.sleep() and it can take up to 5 attempts to read.

I should also note that I've not found any further clues on how to reproduce it aside from connecting without requesting compression. I think it's possibly some kind of race condition?

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Mar 10, 2019

Why should these reads be 0 timeout exactly? In practice this isn't readable without waiting.

if frame.MASK then
	local key = assert(sock:xread(4, "b", 0))
	frame.key = { key:byte(1, 4) }
end

do
	local data = assert(sock:xread(frame.length, "b", 0))
	if frame.MASK then
		frame.data = apply_mask(data, frame.key)
	else
		frame.data = data
	end
end

EDIT:

This is a very naive solution I have been using to try and get a handle on what happens:

local function should_retry(timer) if timer then return timer:status() == "pending" else return true end end

local function read_again(deadline, socket, ...)
    local tries = 0
    local data, msg, code = socket:xread(...)
    local timeout = deadline and deadline - monotime()
    local timer = timeout and promise.new(sleep, timeout)
    while should_retry(timer) and (not data or code == ce.ETIMEDOUT) do
        tries = tries + 1
        sleep()
        data, msg, code = socket:xread(...)
    end
    used_tries = max(used_tries, tries)
    return data, msg, code
end

Calling this function on the socket in place of the xreads and logging the used_tries :

websocket.read_again used 5324 retries.

This makes me think the read_frame is not consuming the deadline somewhere?

@daurnimator
Copy link
Owner

daurnimator commented Mar 10, 2019

Why should these reads be 0 timeout exactly? In practice this isn't readable without waiting.

There is a preceding :fill that will ensure that there are enough bytes waiting in the socket buffer.

When the :fill fails, we poll for readability, and then tail call ourselves to try again:

return read_frame(sock, deadline)

@daurnimator
Copy link
Owner

Are you still having problems?
If I'm unable to replicate your issue I'll have to close it.

@Mehgugs
Copy link
Contributor Author

Mehgugs commented May 8, 2019

Hello, im so sorry i've not been more proactive with this. I'll try and get a reproducible example to work, and as far as im aware yes there's still something strange happening (it is mostly mitiaged right now by requesting compressed messages from the websocket but it by no means fixed).

@Mehgugs
Copy link
Contributor Author

Mehgugs commented May 13, 2019

I think being sent a large enough message over websocket is enough to trigger this.
EDIT:

Mon May 13 14:32:22 2019 INF  uncompressed payload size 6675863 bytes

It is usually this payload which causes this error when it is received by my client without compression.

@daurnimator
Copy link
Owner

Could you please come up with a test case that I can run?

@deltanedas
Copy link

i also have this issue
With litcord run the sample
It will say * BOT NAME ready. then /usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out

@daurnimator
Copy link
Owner

With litcord run the sample
It will say * BOT NAME ready. then /usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out

Could you write a reduced test case?
Trying to run that example I just get "Authentication failed".

@deltanedas
Copy link

You need to get a bot token by making a discord application.

@deltanedas
Copy link

Very odd.
I tried it many times today and it usually times out but sometimes works.
Seems to be discord having a very low timeout.

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Aug 5, 2019

Someone more proficient at lua-http websockets might try setting up a simple server which sends a payload as big as what the server, discord in both cases here, sends.

@12Me21
Copy link

12Me21 commented Dec 19, 2019

I'm getting the same error
Seems to happen when there's a relatively large amount of data being received (in this case, frame.length was around 17KB)
changing the timeout from 0 to 1 on line 282 (local data = assert(sock:xread(frame.length, "b", 0)) -> local data = assert(sock:xread(frame.length, "b", 1)))
seems to fix it (but I'm not sure if that works all the time, and maybe it causes other problems)

@Kagamihara
Copy link

Got the same error
frame => { FIN = true, MASK = true, RSV1 = false, RSV2 = false, RSV3 = false, length = 1265, opcode = 1 }
assert (sock: xread (frame.length, "b", 0))
I got an error here.

The client version is RFC7692, is it irrelevant?

@cleoold
Copy link

cleoold commented May 24, 2020

I have the same problem on websocket.lua:282: read: Connection timed out. i run the receive loop inside a cqueue container.

@jprjr
Copy link

jprjr commented Dec 19, 2020

Hi there, I ran into this issue, here's a small test case to reproduce. I just use a public echo server to send a large-ish string of data (about 12kb) and check that I receive it.

local http_websocket = require'http.websocket'
local large_str = 'some text! '

for i=1,11 do
  large_str = large_str .. large_str
end

local client = http_websocket.new_from_uri('wss://echo.websocket.org:443')
client:connect()
client:send(large_str,'text')
local data, typ = client:receive()
assert(string.len(data) == string.len(large_str))

@sharpobject
Copy link

I have gained an interest in fixing this.

@S-S-X
Copy link

S-S-X commented May 10, 2021

Also been getting that exact same error, did not check that well yet but average response sizes are varying around 70kb but sometimes getting larger.
Not sure if it is just size here as I've not tested at all yet if I can consistently reproduce error with some specific response but thought I'll leave this here just for stats.

@luainin
Copy link

luainin commented Sep 21, 2021

Any updates on this? I get the same issue. The server is in fact sending the packets, but the websocket interface simply times out. However, if I step through this slowly while debugging, I actually get the expected, correct result which leads me to think that this is some sort of race condition.

This issue, like the others above, only happens with large-ish payloads.

Thanks

@Mehgugs
Copy link
Contributor Author

Mehgugs commented Sep 22, 2021

Would you be able to write some code that launches a websocket server and reproduces this issue using a sample packet? It's really unreliable for me -- sometimes it happens, sometimes it doesn't.

Any updates on this? I get the same issue. The server is in fact sending the packets, but the websocket interface simply times out. However, if I step through this slowly while debugging, I actually get the expected, correct result which leads me to think that this is some sort of race condition.

This issue, like the others above, only happens with large-ish payloads.

Thanks

@luainin
Copy link

luainin commented Sep 27, 2021

Sorry, just got back to this -- been pretty busy.
Yeah, I can try -- I think I can make something pretty simple ( a C# console application) to demonstrate this

@LBPHacker
Copy link

LBPHacker commented Jan 17, 2022

Isn't there a + frame.length missing from here?

fill_length = fill_length - 2

The 8-byte length case below does have it, and indeed, "big messages" over 64k are received properly.

Edit: fwiw, adding the suggested bit of code makes the issue go away for me.

@luainin
Copy link

luainin commented Jan 19, 2022

@LBPHacker
I think that fixed it for me.
Thank you so much!

@luainin
Copy link

luainin commented Jan 19, 2022

Can we make a new build/luarocks package with this change?
@daurnimator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests