Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the dswork framework. #1

Open
praveenkul opened this issue Aug 5, 2014 · 4 comments
Open

Regarding the dswork framework. #1

praveenkul opened this issue Aug 5, 2014 · 4 comments

Comments

@praveenkul
Copy link

I was running the given code of Discriminative Mode Seeking paper. I encounter an error in one of the log files. I was running using 6 workers. Though most of workers completed their process. There were few of the workers which did not complete thier jobs assigned and showed some error in the output log files. Details of last 10 lines are given below. Below is the details of error when the function: sampleRandomPatchesbb() was running. I was not able to resolve the issue could you please help me to solve the issue.

ans =

.ds.sample.initInds

ans =

1x1 struct array with no fields.

ismapreducer:1
Reference to non-existent field 'sample'.
file: [1x91 char]
name: 'dsmapredrun'
line: 5

file: [1x92 char]
name: 'dsmapreducer'
line: 316

file: [1x99 char]
name: 'dsmapreducerbarrier'
line: 10

file: [1x96 char]
name: 'dsmapreducerwrap'
line: 6

MATLAB:nonExistentField

checkpassed =

 1

Could you please help me resolve the issue.

Thank
Praveen

@cdoersch
Copy link
Owner

cdoersch commented Aug 5, 2014

Your workers are behaving as if they don't actually believe the file for ds.sample.initInds has actually been saved to disk. This information is supposed to travel from the master to the workers via the file [ds.sys.outdir 'ds/sys/distproc/savestate.mat']. If some of them got the message that the file is there, and some of them didn't, there must be a synchronization issue; i.e. the failed workers are reading an old savestate.mat file.

Can you post the file [ds.sys.outdir 'ds/sys/distproc/savestate.mat']?

Also, can you describe the filesystem you're using?

@praveenkul
Copy link
Author

Sorry for delay in reply and Thanks for reply. I was on leave for few days. Actually I kept the code in the nfs file system and output in xfs file system. As soon as you mentioned about filesystem. I just kept everything in the xfs filesystem. Now the code the running normally.

Thanks

@cdoersch
Copy link
Owner

cdoersch commented Aug 8, 2014

That's odd. Moving the code to the xfs filesystem shouldn't have made any difference, since dswork doesn't write anything to the code directory. I have run dswork without any issues on systems where the code is on nfs and the output directory is on a lustre filesystem.

At any rate, a few failing workers shouldn't affect the integrity of the program, since the failed jobs will just get re-run on other workers. The only exception is the reduce phase of the dsmapreduce, since in that case the jobs are assigned to specific workers, and so one consistently failing worker can make the program get stuck.

Let me know if the issue reoccurs.

@praveenkul
Copy link
Author

Currently I am not able reproduce same error. If it reoccurs I will contact you and also send you the savestate.mat as you suggested. If filesystem was not the actual solution then I do not know how did it got solved.
But I do see behavior earlier if one worker stopped, for example the one I sent you above, the code use to not work further. Like if I go in linux terminal and do 'top' there was 0 percent CPU utilization.

But the code is currently work fine.

Thanks
Praveen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants