-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent workflow failures (>=0.21?) #622
Comments
Actually, looking at that error message L310 should never have worked. |
Ah, it was me! |
Need a proper test of that error message, sorry. However, it looks like that error is covering an underlying error retrieving the runnable tasks... |
See #623 for a simple fix of this issue. The traceback on the underlying error (with error message) is
which suggests the hash of the upstream nodes are being altered somehow (probably deep in a nested object). The fact that all the tasks in the simple workflow are being altered is interesting. Not sure why this error would be happening intermittently, feel like it is going to be pretty hard to track down... |
Is there a set involved? Could be an item ordering issue. |
Not sure. There is quite a bit involved. The fact that it is every node in the workflow is puzzling me though as they wouldn't all share a common input... In general, seeing as though I have hit this problem twice now, I'm wondering whether relying on the hash not changing is a bit brittle and it wouldn't be a better idea to cache the hash when the workflow graph is generated. You could also use this cache to check to see when the hash is changing and raise an error |
Come to think of it, the problem reoccurred after I switched to using |
i don't see anything obvious between |
I don't know its underlying implementation, but if a Insertion order is probably not desirable, either, as I would hope |
I could be wrong about that. I started noticing the issue around the time I switched from 0.20 to 0.21 but it was around the time I switched to fileformats so that makes more sense |
So if that is the issue I shouldn't see this problem with the serial plugin, just the cf, which kind of matches with what I'm experiencing. I usually use serial in my unittests, but not in the "app execution" routine, which is what is failing at the moment |
I will explicitly implement the |
Would still leave it as a bit of a vulnerability with the engine as a whole if you can't use inputs containing |
Yeah. I don't remember how we decide what we're hashing in types, but it would be good to make sure that we return a consistent view. If we're hashing class stableset(frozenset):
def __reduce__(self):
return (self.__class__, tuple(sorted(self, key=hash))) Edit: Well, no. We can't sort based on objects |
Wouldn't def __hash__(self):
return hash(hash(i) for i in sorted(self)) work? |
Well it would at least for FileSets which have |
That assumes that the objects in
|
Taking a different approach, couldn't we just set the hashseed globally across all worker nodes? |
https://gerrychain.readthedocs.io/en/latest/topics/reproducibility.html just suggests setting it to 0 |
Yes, but then we still get a different hash in the coordinator thread, since we can't control what |
Can we set them to whatever the hash seed is for the base interpreter is when creating the processes? |
This thread seems to suggest that multiprocessing processes are passed the same hash seed https://stackoverflow.com/questions/52044045/os-environment-variable-reading-in-a-spawned-process |
For more distributed execution plugins, could we write the hash seed to a file somewhere in the cache directory and make sure it is set in the environment when the worker is loaded? |
That would probably kill any hope for Windows support, since I don't think Another approach could be to detect this case and raise an error saying |
Also, no. Because |
That would work for sets passed directly to inputs, but if they are internal to the object being passed, then we would have to iterate through all the nested members |
That's annoying |
I just mean we detect the hash change and infer that there must be an unstable type somewhere in the spec. If people don't know what is in the types they're using, that only goes so far, but it's better than nothing. |
Explicitly hashing FileSet resolved the error I was having. Will have to wait and see if it reoccurs |
I could just amend the error message in my PR to say something along those lines |
Actually, the error did reoccur, but I think I didn't succeed in setting up the manual hash. #626 should fix the hash. Maybe we can look at it during the hack week |
What version of Pydra are you using? 0.21
What were you trying to do/What did you expect will happen? run a workflow successfully
What actually happened? workflow intermittently fails with error
Can you replicate the behavior? If yes, how? Not easily. See https://github.com/ArcanaFramework/arcana-xnat/actions/runs/4278435805/jobs/7448157567 for example where the workflow fails in one iteration of the matrix but not the other. Note that exactly the same code base passed successfully just moments later, https://github.com/ArcanaFramework/arcana-xnat/actions/runs/4278436386, so it is likely do to the workflow graph being executed in different orders between runs.
I have fixed my local dev version to 0.20, which seems to avoid the issue. So I was wondering whether anything might have changed between 0.20 and 0.21 that could lead to these sort of errors.
I was having a similarly intermittent problem with 0.21 where the result of a executing a Workflow was being returned as
None
instead ofpydra.engine.specs.Result
, which I'm suspecting is related.The text was updated successfully, but these errors were encountered: