Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object/put: fix concurrent PUT data corruption #3027

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

carpawell
Copy link
Member

@carpawell carpawell commented Nov 23, 2024

If ants pool is busy and cannot take task, early return without wg.Wait()
leads to iterateNodesForObject's return and all the buffers for binary
replication from now may be reused while are still in use by the other routines
inside the pool. Wait for WG before any return is called. Closes #2978, #2988,
#2975, #2971.

@carpawell carpawell self-assigned this Nov 23, 2024
@carpawell carpawell force-pushed the fix/data-corruption-on-hight-load branch from 943eaaf to a63db8a Compare November 23, 2024 12:58
@carpawell
Copy link
Member Author

@roman-khimov, this is almost the same as in the 43-fix branch. However, this code is so different now. I am still not sure what should be done here in general. Suggestions are highly welcome.

@carpawell
Copy link
Member Author

Also we can try node one more time (j--).

Copy link
Member

@roman-khimov roman-khimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's just fix the bug without any behavioral changes. This code can and will be improved, but that's a bit different story.

pkg/services/object/put/distributed.go Outdated Show resolved Hide resolved
@roman-khimov
Copy link
Member

Also, be more aggressive with "close XXX", we have a number of issues here.

@carpawell carpawell force-pushed the fix/data-corruption-on-hight-load branch from a63db8a to 004fcf0 Compare November 25, 2024 15:11
@carpawell
Copy link
Member Author

@roman-khimov, added issues for my taste.

@carpawell carpawell force-pushed the fix/data-corruption-on-hight-load branch from 004fcf0 to ea3e03b Compare November 25, 2024 15:13
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 22.63%. Comparing base (339b4cb) to head (872baaf).
Report is 25 commits behind head on master.

Files with missing lines Patch % Lines
pkg/services/object/put/distributed.go 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3027      +/-   ##
==========================================
- Coverage   22.85%   22.63%   -0.23%     
==========================================
  Files         791      791              
  Lines       58734    58493     -241     
==========================================
- Hits        13425    13238     -187     
+ Misses      44412    44358      -54     
  Partials      897      897              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roman-khimov
Copy link
Member

Test needs to be fixed.

@carpawell carpawell marked this pull request as draft November 25, 2024 16:42
If ants pool is busy and cannot take task, early `return` without `wg.Wait()`
leads to `iterateNodesForObject`'s `return` and all the buffers for binary
replication from now may be reused while are still in use by the other routines
inside the pool. Wait for WG before any `return` is called. Closes #2978,
closes #2988, closes #2975, closes #2971.

Signed-off-by: Pavel Karpy <[email protected]>
If an object cannot be PUT due to local overload (i-th routine for (i-1)-length
worker pool), log the error and continue over other nodes, and even other
placement vectors. `errNotEnoughNodes` will be also returned as the natural
replication number handling in the outer `for`.

Signed-off-by: Pavel Karpy <[email protected]>
@carpawell carpawell force-pushed the fix/data-corruption-on-hight-load branch from 06efc8e to 872baaf Compare November 25, 2024 18:01
@carpawell
Copy link
Member Author

Also, added one more test.

@carpawell carpawell marked this pull request as ready for review November 25, 2024 18:21
@roman-khimov roman-khimov merged commit a6effcf into master Nov 25, 2024
21 of 22 checks passed
@roman-khimov roman-khimov deleted the fix/data-corruption-on-hight-load branch November 25, 2024 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic in binary replication
3 participants