Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Afanasy Crashing #579

Open
eberrippe opened this issue May 11, 2023 · 3 comments
Open

Afanasy Crashing #579

eberrippe opened this issue May 11, 2023 · 3 comments

Comments

@eberrippe
Copy link

Hi @timurhai
About every second day we encounter a little afanasy crash. We try to figure out what might be the reason and didnt find anything yet. Maybe you have an idea of what we can do to solve it. These are logs the most common errors we have.

Fri 05 May 18:13.07: Job registered: yyyy"[3554]: xxxx@xxxxx[20] - 11402 bytes.
Fri 05 May 18:13.55: ERROR   EPOLLERR: aereaaaaSFD:20 S:Processing REQ: TJSON[2182]: 123412
SIG INT
Fri 05 May 19:56.02: ERROR   reconnectTask: numtask >= numTasks ( 20 >= 1 )
Fri 05 May 19:56.02: Render:  yyyx@yyyy[37] unix linux 1234 ON 
Fri 05 May 19:56.03: WARNING Client has NOT closed socket first:1234 SFD:19 S:SWaiting REQ: TJSON[215]: 2324424 ANS: TJSON[185308]: Empty address
Fri 05 May 19:56.07: Render: "yyyyyy" - ZOMBIETIME
Fri 05 May 19:56.07: Render Offline: yyyyy@yyyyyy[913] unix linux 123412412 off
Fri 05 May 19:56.09: WARNING Client has NOT closed socket first: 1234124123 SFD:729 S:SWaiting REQ: TJSON[37]: 1234 ANS: TJSON[23792778]: Empty address
Fri 05 May 19:56.09: WARNING Client has NOT closed socket first: 1234123 SFD:13 S:SWaiting REQ: TJSON[37]: 12314123 ANS: TJSON[23792778]: Empty address
corrupted size vs. prev_size
Tue 09 May 16:04.28: Deleting a job: "xxxxxxa"[2120]: yyyy@yyyyy[0] - 68317 bytes.
Tue 09 May 16:04.28: Deleting a job: "xxxxx"[4658]: yyyy@yyyyyyy[1] - 39174 bytes.
malloc(): invalid size (unsorted)
Tue 09 May 16:28.57: Job registered: yxxxyxyxy"[4097]: xxxxx@yyyyyyyy[1148] - 8472 bytes.
Tue 09 May 16:28.57: Job registered: "xxxxxxx"[4098]: yyyy@yyyyyy.[2] - 9550 bytes.
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
Wed 10 May 19:20.11: Deleting a job: xxxx: xxx@yyyy[992] - 8596 bytes.
Wed 10 May 19:20.11: Deleting a job: "xxxx"[5362]: xyyy@yyyy[1101] - 8489 bytes.
ERROR Wed 10 May 19:20.23: Online render with the same name exists:
New render:
 yyyyy@yyyyy[0] unix linux 12341 ON 
Existing render:
 yyyyyx@yyyyy[425] unix linux 1234 ON  P
Wed 10 May 19:20.32: WARNING Client has NOT closed socket first: 123 SFD:6 S:SWaiting REQ: TJSON[117]: 123 ANS: TJSON[143406]: Empty address
Wed 10 May 19:20.48: Render Offline:  yyyyy.@yyyyy[344] unix linux 23456 off
Wed 10 May 19:20.55: Job registered: "xxxxx"[569]: yyyyr@yyyyyy[8] - 5384 bytes.
malloc(): invalid size (unsorted)```
Thu 11 May 12:14.08: WARNING Client has NOT closed socket first: 123 SFD:930 S:SWaiting REQ: TJSON[37]: 123:123ANS: TJSON[20969412]: Empty address
Thu 11 May 12:14.09: WARNING Client has NOT closed socket first: 123 SFD:20 S:SWaiting REQ: TJSON[37]: 123 ANS: TJSON[20969412]: Empty address
Thu 11 May 12:14.12: ERROR   reconnectTask: numblock >= blocksnum ( 1 >= 1 )
Thu 11 May 12:14.12: Render:  yyyyyx@yyyyy[27] unix linux 1234 ON 
ERROR Thu 11 May 12:14.26: Online render with the same name exists:
New render:
 tttttt@yyyyyy[0] unix linux 2345 ON 
Existing render:
 tttttttt@yyyyy[425] unix linux 2345 ON  P
corrupted size vs. prev_size

Thanks a lot and best

Jan

@timurhai
Copy link
Member

Hello!
Very strange log. I have not see such errors.
So, you can't reproduce the bug?
What is the version, OS, how much clients?

All clients not close socket, or just some? Try to find "bad" clients.
May be some Web browser not closes socket.
Try not to use WebGUI at all.

@eberrippe
Copy link
Author

I can not intentionally reproduce the bug sadly.
We run afserver on

NAME="AlmaLinux"
VERSION="9.1 (Lime Lynx)"
ID="almalinux"

We have a total of 951 hosts.

How can I find out about the Socket state? We only use the webgui very occasionally. What about the web GUI causes problems? We developed some connections to the afserver by ourselves. Is there something you can advice us to keep in mind when doing so?

Thanks
Jan

@timurhai
Copy link
Member

If you connect to afserver, you should close socket first after the server answer.

https://cgru.readthedocs.io/en/latest/afanasy/server.html#time-wait

Web browsers do not closes sockets sometimes.
If you have such big amount of clients, try not to use WebGUI at all.
(Somebody can open it just to ckeck something, then forget to close it, and it will produce TIME-WAIT sockets periodically. But may be this is not your case at all.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants