Skip to content
This repository has been archived by the owner on Feb 27, 2023. It is now read-only.

bugfix: the dfget will always receive fail result when change supernode #752

Merged
merged 1 commit into from
Aug 1, 2019

Conversation

wybzju
Copy link
Contributor

@wybzju wybzju commented Jul 31, 2019

Signed-off-by: wybzju [email protected]

Ⅰ. Describe what this PR did

1 there have two supernode A and B
2 dfget registers to A and download some piece from A
3 suddenly, the A down
4 dfget will get error when pull piece task from A, then dfget register to node B
5 it will pull piece task again, but the item has dstCID. However, B doesn't have the task, and can't get dfgettask by dstCID and taskID. Then B will return err. Finally, dfget will receive error again

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗?)

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

@pouchrobot pouchrobot added kind/bug This is bug report for project size/XS labels Jul 31, 2019
@pouchrobot
Copy link
Collaborator

We found this is your first time to contribute to Dragonfly, @wybzju
👏 We really appreciate it.
Just remind that you have read the contribution guide: https://github.com/dragonflyoss/Dragonfly/blob/master/CONTRIBUTING.md
If you didn't, you should do that first. If done, welcome again and please enjoy hacking! 🍻

@codecov-io
Copy link

codecov-io commented Jul 31, 2019

Codecov Report

Merging #752 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #752      +/-   ##
==========================================
- Coverage   46.55%   46.53%   -0.02%     
==========================================
  Files         113      113              
  Lines        6352     6354       +2     
==========================================
  Hits         2957     2957              
- Misses       3156     3158       +2     
  Partials      239      239
Impacted Files Coverage Δ
supernode/server/0.3_bridge.go 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c86ccf5...fed30ad. Read the comment docs.

@yunfeiyanggzq
Copy link
Member

Signed-off-by: wybzju [email protected]

Ⅰ. Describe what this PR did

1 there have two supernode A and B
2 dfget registers to A and download some piece from A
3 suddenly, the A down
4 dfget will get error when pull piece task from A, then dfget register to node B
5 it will pull piece task again, but the item has dstCID. However, B doesn't have the task, and can't get dfgettask by dstCID and taskID. Then B will return err. Finally, dfget will receive error again

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗?)

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

could you put the error in github

@yunfeiyanggzq
Copy link
Member

#682 does the error is the same as the second bug in this pr

@@ -121,9 +121,11 @@ func (s *Server) pullPieceTask(ctx context.Context, rw http.ResponseWriter, req
if !cutil.IsEmptyStr(dstCID) {
dstDfgetTask, err := s.DfgetTaskMgr.Get(ctx, dstCID, taskID)
if err != nil {
return err
logrus.Warnf("srcCID:%s, failed to get dfget task by dstCID(%s) and taskID(%s), err: %v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please output the log as the following format: logrus.Warnf("failed to ........")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be better

if err!=nil{
     logrus.Warnf("failed to ......")
     return  err
}
request.DstPID=.....

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the end,I want you to show me the detail error log.I recently work on this part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wybzju
Copy link
Contributor Author

wybzju commented Jul 31, 2019

@yunfeiyangbuaa It is not the same with #682.
In pr #682, when supernode return err, dfget will panic. because res is not nil, but the BaseResponse is nil. res == nil || (res.Code != constants.CodePeerContinue so, here will panic.
And I fix it by using err != nil || res == nil || (res.Code != constants.CodePeerContinue

In this pr, if the pr #682 is merged. when supernode down, dfget will register to another supernode. However, anther supernode don't have the dst dfgettask, it will return date not found err. The dfget will receive error again

@yunfeiyanggzq
Copy link
Member

@yunfeiyangbuaa It is not the same with #682.
In pr #682, when supernode return err, dfget will panic. because res is not nil, but the BaseResponse is nil. res == nil || (res.Code != constants.CodePeerContinue so, here will panic.
And I fix it by using err != nil || res == nil || (res.Code != constants.CodePeerContinue

In this pr, if the pr #682 is merged. when supernode down, dfget will register to another supernode. However, anther supernode don't have the dst dfgettask, it will return date not found err. The dfget will receive error again

I get what do you mean,so it means the multiple supernodes pattern can't work in fact?

@wybzju
Copy link
Contributor Author

wybzju commented Jul 31, 2019

@yunfeiyangbuaa yes, you are right, but it can be fixed.
here is the log

2019-07-31 13:46:07.439 ERRO sign:2028-1564551966.183 : pull piece task error: dial tcp4 supernodeIP:8002: connect: connection refused
2019-07-31 13:46:07.439 ERRO sign:2028-1564551966.183 : pull piece task fail:{} and will migrate
2019-07-31 13:46:07.439 INFO sign:2028-1564551966.183 : do register to one of [supernodeIP:8002 supernodeIP:8002 supernodeIP:8002 supernodeIP:8002 supernodeIP:8002]
2019-07-31 13:46:07.441 INFO sign:2028-1564551966.183 : do register to supernodeIP:8002, res:{"code":200,"msg":"success","data":{"taskId":"taskID","fileLength":length,"pieceSize":4194304}} error:<nil>
2019-07-31 13:46:07.441 INFO sign:2028-1564551966.183 : do register result:{"code":200,"msg":"success","data":{"taskId":"taskID","fileLength":length,"pieceSize":4194304}} and cost:0.001s
2019-07-31 13:46:07.441 ERRO sign:2028-1564551966.183 : pull piece task error: 500:{"message":"{\"Code\":502,\"Msg\":\"key (dstCID@taskID): {\"Code\":0,\"Msg\":\"data not found\"}\"}"}
2019-07-31 13:46:07.441 ERRO sign:2028-1564551966.183 : pull piece task fail:{} and will migrate

then the dfget will always receive the data not found error until try all supernode

1 some dfclients start to pull images
2 stop the corresponding supernode
3 you can see above log

@yunfeiyanggzq
Copy link
Member

Thanks!

@wybzju wybzju force-pushed the supernode_bugfix branch from 26ef590 to 4847085 Compare July 31, 2019 06:11
@wybzju
Copy link
Contributor Author

wybzju commented Jul 31, 2019

@yunfeiyangbuaa what's more, you need to fix the panic problem first. Or you won't see above error, because the dfget will panic and not register to another supernode

@yunfeiyanggzq
Copy link
Member

yunfeiyanggzq commented Jul 31, 2019

@yunfeiyangbuaa what's more, you need to fix the panic problem first. Or you won't see above error, because the dfget will panic and not register to another supernode

I have make the res be nil.but I think the team should do more with the multiple supernodes pattern.because dfget can register to another supernode but can't download file in fact.I and you just do with the err.WDYT? @lowzj

@@ -121,9 +121,11 @@ func (s *Server) pullPieceTask(ctx context.Context, rw http.ResponseWriter, req
if !cutil.IsEmptyStr(dstCID) {
dstDfgetTask, err := s.DfgetTaskMgr.Get(ctx, dstCID, taskID)
if err != nil {
return err
logrus.Warnf("failed to get dfget task by dstCID(%s) and taskID(%s), and the srcCID is %s, err: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not a good idea to ignore the error here. Instead, maybe we can delete the dstCID after migrate for dfget.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so. Here, dstCID is used to obtain the corresponding peerID, and then the state of the corresponding peerID is updated. If there is no corresponding dfgetTask at suopernode, then there is no need to update the state of the peerID. So you can continue to execute.
Of course, dfget can delete the dstCID and what's more, it need to delete dstCID for all items which are stored in the queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I agree with you. And I think we should speed ​​up the progress of HA. @yunfeiyangbuaa 😄

@starnop
Copy link
Contributor

starnop commented Jul 31, 2019

LGTM.

@wybzju wybzju force-pushed the supernode_bugfix branch from 4847085 to fed30ad Compare July 31, 2019 14:23
@allencloud allencloud merged commit fd9988a into dragonflyoss:master Aug 1, 2019
starnop pushed a commit to starnop/Dragonfly that referenced this pull request Nov 27, 2019
bugfix: the dfget will always receive fail result when change supernode
inoc603 pushed a commit to inoc603/Dragonfly that referenced this pull request Dec 23, 2019
bugfix: the dfget will always receive fail result when change supernode
sungjunyoung pushed a commit to sungjunyoung/Dragonfly that referenced this pull request May 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug This is bug report for project size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants