Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in binary replication #2978

Closed
carpawell opened this issue Oct 23, 2024 · 4 comments · Fixed by #3027
Closed

Panic in binary replication #2978

carpawell opened this issue Oct 23, 2024 · 4 comments · Fixed by #3027
Assignees
Labels
bug Something isn't working I4 No visible changes S4 Routine U2 Seriously planned
Milestone

Comments

@carpawell
Copy link
Member

окт 23 14:47:46 metis3 neofs-node[2911]: 2024/10/23 14:47:46.517936 [ants]: worker exits from panic: runtime error: slice bounds out of range [109:0]
окт 23 14:47:46 metis3 neofs-node[2911]: goroutine 340953791 [running]:
окт 23 14:47:46 metis3 neofs-node[2911]: runtime/debug.Stack()
окт 23 14:47:46 metis3 neofs-node[2911]:         runtime/debug/stack.go:24 +0x5e
окт 23 14:47:46 metis3 neofs-node[2911]: github.com/panjf2000/ants/v2.(*goWorker).run.func1.1()
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/panjf2000/ants/[email protected]/worker.go:56 +0x85
окт 23 14:47:46 metis3 neofs-node[2911]: panic({0x1102a00?, 0xc02d60d9c8?})
окт 23 14:47:46 metis3 neofs-node[2911]:         runtime/panic.go:770 +0x132
окт 23 14:47:46 metis3 neofs-node[2911]: github.com/nspcc-dev/neofs-node/pkg/services/object/put.putObjectLocally({0x13fc108, 0xc00033e930}, 0xc072bb18c0, {0x12827d10?, {0x0?, 0xc00d581d30?, 0xc>
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/nspcc-dev/neofs-node/pkg/services/object/put/local.go:80 +0x34d
окт 23 14:47:46 metis3 neofs-node[2911]: github.com/nspcc-dev/neofs-node/pkg/services/object/put.(*localTarget).Close(0xc0331a59d0)
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/nspcc-dev/neofs-node/pkg/services/object/put/local.go:51 +0x4f
окт 23 14:47:46 metis3 neofs-node[2911]: github.com/nspcc-dev/neofs-node/pkg/services/object/put.(*distributedTarget).sendObject(0xc03357a800, {0x1, {{0xc032dfb7b0, 0x1, 0x1}, {0x0, 0x0, 0x0}, {>
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/nspcc-dev/neofs-node/pkg/services/object/put/distributed.go:208 +0x162
окт 23 14:47:46 metis3 neofs-node[2911]: github.com/nspcc-dev/neofs-node/pkg/services/object/put.(*distributedTarget).iteratePlacement.func1()
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/nspcc-dev/neofs-node/pkg/services/object/put/distributed.go:250 +0x11c
окт 23 14:47:46 metis3 neofs-node[2911]: github.com/panjf2000/ants/v2.(*goWorker).run.func1()
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/panjf2000/ants/[email protected]/worker.go:67 +0x8d
окт 23 14:47:46 metis3 neofs-node[2911]: created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 340954422
окт 23 14:47:46 metis3 neofs-node[2911]:         github.com/panjf2000/ants/[email protected]/worker.go:48 +0x5c

Expected Behavior

No panic

Current Behavior

Panic

Possible Solution

Do not do out of range?

Steps to Reproduce (for bugs)

Not sure, some high load with blocks that @AnnaShaleva and @AliceInHunterland usually do.

Context

Loading neo-go blocks to NeoFS.

Your Environment

NeoFS Storage node
Version: 0.43.0
GoVersion: go1.22.6

@carpawell
Copy link
Member Author

It always had util/log.go:19 could not push task to worker pool {"request": "PUT", "error": "too many goroutines blocked on submit or Nonblocking is set"} log before panic.

@roman-khimov roman-khimov added bug Something isn't working U2 Seriously planned S4 Routine I4 No visible changes labels Oct 23, 2024
@roman-khimov roman-khimov added this to the v0.44.0 milestone Oct 23, 2024
@roman-khimov
Copy link
Member

  1. We have a lot (thousands) of these on all machines.
  2. Happens during load.
  3. Not related to network map changes.
  4. Offset is always 109:0.
  5. Tends to come in series of 3-5 events at sub-second level on a single machine.

@carpawell
Copy link
Member Author

@roman-khimov, i think we found the root reason for it? Also we know that it will not be possible in 0.44

@roman-khimov
Copy link
Member

It will be possible if you won't fix it. master can still have this problem.

carpawell added a commit that referenced this issue Nov 23, 2024
If ants pool is busy and cannot take task, early `return` without `wg.Wait()`
leads to `iterateNodesForObject`'s `return` and all the buffers for binary
replication from now may be reused while are still in use by the other routines
inside the pool. Wait for WG and try other nodes more instead, it also can
increase the rate of successful PUTs at high loads. Closes #2978.

Signed-off-by: Pavel Karpy <[email protected]>
carpawell added a commit that referenced this issue Nov 23, 2024
If ants pool is busy and cannot take task, early `return` without `wg.Wait()`
leads to `iterateNodesForObject`'s `return` and all the buffers for binary
replication from now may be reused while are still in use by the other routines
inside the pool. Wait for WG and try other nodes more instead, it also can
increase the rate of successful PUTs at high loads. Closes #2978.

Signed-off-by: Pavel Karpy <[email protected]>
carpawell added a commit that referenced this issue Nov 25, 2024
If ants pool is busy and cannot take task, early `return` without `wg.Wait()`
leads to `iterateNodesForObject`'s `return` and all the buffers for binary
replication from now may be reused while are still in use by the other routines
inside the pool. Wait for WG before any `return` is called. Closes #2978, #2988,
#2975, #2971.

Signed-off-by: Pavel Karpy <[email protected]>
carpawell added a commit that referenced this issue Nov 25, 2024
If ants pool is busy and cannot take task, early `return` without `wg.Wait()`
leads to `iterateNodesForObject`'s `return` and all the buffers for binary
replication from now may be reused while are still in use by the other routines
inside the pool. Wait for WG before any `return` is called. Closes #2978,
closes #2988, closes #2975, closes #2971.

Signed-off-by: Pavel Karpy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working I4 No visible changes S4 Routine U2 Seriously planned
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants