-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type checking and coercion #662
Conversation
for more information, see https://pre-commit.ci
@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script. |
Yeah, I tried to avoid the need to add the "gathered" syntax as it seems a bit superfluous, but having to check whether a value is actually a list of that value, or a list of list of that value, etc... just makes the type-checking impractical (unless anyone has any good ideas). I personally think it makes the code a bit easier to understand if values that are to be split are explicitly marked as such, but definitely something it would be good to discuss. Note that the "type-checking" process is actually as much about coercing strings/paths into File objects so we know how to hash them properly as it is about validation. So I think it will be hard to get away from some form of type-parsing (the existing code this PR replaces already had quite a bit of special-case handling to deal with File objects).
I would perhaps argue that the direction of travel with Python is to move towards a stricter, type-checked, language. Certainly editors like VSCode are pushing you to write code that way, with type-checked linting switched on by default. On the topic of linting, I have been thinking about what would it would take to facilitate static checking and hinting of pydra workflows. I will try to put together an issue summarising what would be required so we can also debate its merits.
Perhaps if it had a better name than |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #662 +/- ##
==========================================
+ Coverage 81.77% 82.84% +1.07%
==========================================
Files 20 22 +2
Lines 4400 4845 +445
Branches 1264 0 -1264
==========================================
+ Hits 3598 4014 +416
- Misses 798 831 +33
+ Partials 4 0 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
on one hand I don't like introducing like |
Thinking a bit about I believe the issue here is in: a = TaskA(x=[1, 2, 3, 4]).split("x") When What if we made the syntax instead: a = TaskA().split("x", x=[1, 2, 3, 4]).combine("x") In this formulation, Lazy inputs/outputs are not going to be reasonably type-checkable, but we could still build-time check them, and should be able to tell whether inputs are mapped or not. We could use a |
Thanks 🙏 I'm really chuffed you think so 😊 On the subject of keeping the basic type-free functionality intact, one alteration I have been considering that would be good to discuss in the meeting, is that of relaxing the type-checking at construction time (i.e. lazy fields) so that base classes can be passed to sub-classed type fields to avoid need to add cast statements between loosely-typed tasks and tightly-typed tasks. Stricter type-checking/coercing can still be performed at runtime. For example, if B is a subclass of A, then currently you can connect a B-typed output into a A-typed input but if you wanted to connect A-typed to B-typed you would need to cast it to B first. While this aligns with how traditional type-checking is done, we might want to be a bit more flexible so you could connect generic It would somewhat reduce the effectiveness of the type-checking, but would probably avoid any (false-positive) cases where the type-checking could be annoying |
… to use serial plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this review "diagonally" and provided minor comments along the way.
pydra/engine/core.py
Outdated
if self._lzout: | ||
raise Exception( | ||
f"Cannot split {self} as its output interface has already been accessed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this use case be triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you now try to split the node after you have accessed the lzout of the node, it will raise an error, e.g.
wf.add(A(x=1, name="a"))
wf.add(B(w=wf.a.lzout.z))
wf.a.split(y=[1, 2, 3]) # <--- Will raise an error, because it will change z from a single value to a list/state-array
return SpecInfo( | ||
name="Inputs", | ||
name=spec_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec_name
used to be set to "Inputs". Now that it has become a parameter, what happens if spec_name
is set to None or the empty string?
Options may include to either coalesce with spec_name = spec_name or "Inputs"
, or define a default parameter value, or raise an error if spec_name
is unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't spec_name always be a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's Python at the end of the day, so spec_name
can be anything including None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably just raise the error in this case
…t more involved. Will add in separate PR
Types of changes
This PR contains a number of breaking changes:
TaskBase.split()
method (i.e. instead of in Task.init or by setting inputs attribute).lzout
interface is accessed (so the type of the LazyField can be set properly)output_file_template
must also be defined in the output spec withfileformats.generic.File
(or subclass) type, in addition to a input field ofstr
,Path
,ty.Union[str, bool]
orty.Union[Path, bool]
type (@ghisvail I forgot about this when you asked whether there were any implications for task interface design).set_input_validators()
has been removedFile
andDirectory
inpydra.engine.specs
now refer to the classes infileformats.generic
. As such, existence (and format) checking of non-lazy inputs is performed inFileSet.__init__
, not at runtime as was the case previouslyattrs.NOTHING
not None (not sure if this actually changes the overall behaviour though as I believe they are always overwritten)File
is not sufficient and you would need to usefileformats.medimage.NiftiX
). However, on the plus sideSummary
Type checking
Adds detailed type checking at both run and construction time (i.e. lazy fields carry the type of the field they point to). This is done by
pydra.utils.TypeParser
, which is works as anattrs
converter and is set to be the converter into every field in the generated input/output specs by default.The type parser unwraps arbitrary nested generic types (e.g. List, Tuple and Dict in typing package) and tests that they either match or are a subclass of the declared types.
Type coercion
In addition to the type-checking, selected types will be automatically coerced to their declared type:
Like the type-checking, the coercion is performed between types within arbitrarily nested containers so can handle coercions, e.g.
dict[str, list[str]] -> dict[str, tuple[File]]
Hashing
Hashing is implemented (by @effigies) in a dispatch method, which handles hashes for non-ordered objects such as sets in a process-independent way.
Hashing of files is now handled outside of Pydra within the
fileformats
package. This enables the hashes to be aware of changes in associated files.Copying files
Copying files (e.g. into working directories) is also handled in
fileformats
, and includes graceful fallback between leaving the files in-place, symlinking, hardlinking and copying files (in order of preference), taking into account whether the files need to be in the same folder/have the same file-stem.Checklist