-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR0007: How we are going to validate the new API codebase #1301
ADR0007: How we are going to validate the new API codebase #1301
Conversation
5c29e1e
to
5b682ec
Compare
Add a decision record about the different options which we can use to validate the new code under tuf.api. TODO: Make decision. Signed-off-by: Martin Vrachev <[email protected]>
5b682ec
to
4b2f28c
Compare
I realized there is a fundamental difference between the approaches of the On another hand, the I am a little worried about the second approach. Personally, I will feel safer if we do pre modification checks and block execution early in our functions. |
by default pydantic only checks at object initialization time as well, right? You need to explicitly markup your setters with Generally speaking I'm probably going to be grumpy about any runtime dependencies for this: pydantic as an example is 7700 lines of code. Someone will have to explain to pip developers how vendoring that is going to make their life better... I know that the current checks may not be the best things to measure against but ... I wonder if it would help to try to quantify the existing schema checks and the validation they do: Quick grep/sed/sort/uniq (that probably missed some things) says there aren't that many schemas that are really re-used extensively in tuf source code:
Reviewing the most used ones carefully might be a good idea: Is the check useful? what exactly gets checked when check_match() is called in this case? Can our proposal provide a good solution for this particular check -- is our proposal clearly better than status quo? Some questions I've had, just thinking out loud:
|
Let's get rid of the schemas. This is Python, not Protobuf. |
I realized I made my code little confusing. Will get back to your other thoughts later when I do a little more research. |
Strict types are available in pydantic: https://pydantic-docs.helpmanual.io/usage/types/#strict-types it's just sad there is no class-wide strict mode implemented yet: See: pydantic/pydantic#1098 Signed-off-by: Martin Vrachev <[email protected]>
Some good news: It's just sad there is no class-wide strict mode implemented yet which will automatically treat a standard arguments PS: I pushed a commit in the adr clarifying this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the detailed research, @MVrachev! Here are some of my thoughts...
multiple fields. | ||
|
||
* Good, because it allows reuse of the validation code through | ||
`securesystemslib.schemas` or another schema of our choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are focusing too much on securesystemslib.schemas
. There is an unspoken consensus that we want to get rid of it (see secure-systems-lab/securesystemslib#183). And I don't think that its use is a defining aspect of the ValidationMixin
. IIRC we only used them in in-toto because it was already there and we lacked a comprehensive class model for all complex data types.
Take for instance the _validate_keys
method on the in-toto Layout
class:
def _validate_keys(self):
securesystemslib.formats.ANY_PUBKEY_DICT_SCHEMA.check_match(self.keys)
Now if we had a PublicKey
class with its own validators -- something we intend to add as per ADR4 particularly for the purpose of validation -- then the validator would probably look like this:
def _validate_keys(self):
if not isinstance(self.keys, dict):
raise ...
for keyid, key in self.keys.items():
if not isinstance(key, PublicKey):
raise ...
key.validate()
# NOTE: Even though ADR4 only talks about classes for complex attributes
# it probably makes sense to add a `KeyID` class as well just so that we
# have a reusable way of checking hex strings of a certain length.
if not isinstance(keyid, KeyID):
raise ...
keyid.validate()
Or we could add a PublicKeyDict
class, and do:
def _validate_keys(self):
if not isinstance(self.keys, PublicKeyDict):
raise ...
self.keys.validate()
Or, and that's my preferred way, we let a type annotation checker do the basic type checking for us, so that we always know that an in-toto layout's pubkeys
variable contains a valid Dict[KeyID, PublicKey]
value. Then we can use the _validate_*
for more complex checks, such as:
def _validate_keys(self):
# We already know that 'self.pubkeys' is a valid Dict[KeyID, PublicKey]
# value so no need to check this here...
# ... but we also want to make sure that we don't have duplicate keys.
if len(self.pubkeys.keys()) != len(set(self.pubkeys.keys())):
raise ...
I'd say the ValidationMixin
really just provides a shortcut -- i.e. validate()
-- to all _validate_*
methods on a class, which seems very similar to pydantic's @validator
decorator (please correct me if I'm wrong!!)
So the main question to me is, is that decorator better than our custom _validate_*
convention, and does pydantic
have other features (that we can't just easily add to our ValidationMixin
) that justify a +7K LOC dependency. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed: we should use ideally no 3rd-party dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to update the ADR to not sound like the idea with ValidationMixin
fully depended on the schemas, but
it seems it didn't sound that way :D.
I understand the idea with the ValidationMixin
and it has good points.
If we decide to go into that path, probably it's fixing the existing schemas instead of straightly removing them.
Or, and that's my preferred way, we let a type annotation checker do the basic type checking for us,
@lukpueh which type of annotation checker do you use in in-toto
?
How many dependencies will it add into tuf
and (if you can easily check that) how many lines of code?
As I said in my detailed comment here #1301 (comment) there are more useful features in pydantic
which could be useful for us.
Thanks for the quick grep/sed/sort/uniq-ing, @jku! This is good data to backen my first concern in secure-systems-lab/securesystemslib#183, i.e. schemas sound more specific than they are. Four of the five most popular schemas just check if a value is a string (and only one if it's not empty). I think we are better off with type hints there. |
Add summaries of how our options compare against our requirements. Signed-off-by: Martin Vrachev <[email protected]>
Signed-off-by: Martin Vrachev <[email protected]>
da588e5
to
c6c706c
Compare
I added two additional commits. In the first one, I address some of Lukas comments and more importantly adding two additional requirements. In the second commit, I documented my observations on our third option - Before I give my opinion, I will document one final notable option - |
After a good look at typical, I realized that it's probably not worth considering as an option. The main problems I have with it are that:
|
I just understood, that there is a way to invoke all validators with pydantic whenever we want with a helper function. Correct the ADR with this new information and fix some typos. Signed-off-by: Martin Vrachev <[email protected]>
There are multiple limitations in the "typical" library to consider it as an option for our new API which is going to be used from variety of projects, some of which are big like PyPI. The main problems I have with it are: 1. It's a relatively unknown libr1ary with only 111 stars on GitHub. 2. It doesn't allow for custom validators. It's focused on type checking and it allows for some constraints on the function arguments/class attributes, but nothing fancier than that. Adding the possibility for custom validators is planned for future versions. 3. It's a one-man show with only one maintainer actively working on the project which is a big red sign for PyPI. 4. It adds 4 additional dependencies pytzdata, pendulum, inflection, and itself - typical. 5. It doesn't support python 3.6. Signed-off-by: Martin Vrachev <[email protected]>
If nobody has any other suggestions for libraries/options worth exploring, I conclude my research. I updated the |
Signed-off-by: Martin Vrachev <[email protected]>
From the third-party solutions, I strongly preffer pydantic over marshmallow. The big question is: do we want to pay the price of adding two additional dependencies The useful features that are implemented in
Finally, I would add that when adding a new dependency there will be a price
|
Thanks for your investigations, Martin! I hate to sound cynical, but I'm wary of any project (such as pydantic) that depends on one developer... |
@trishankatdatadog I don't agree with the statement that So, if one-day @samuelcolvin don't have time to work on |
Thank you for digging in and researching these options, and the attention to detail in the presentation @MVrachev! The rich diff with the tables makes the information easier to process. There are a couple of things that were noted in the initial Issue which I think are missing from this discussion:
|
Thanks for considering pydantic (I'm the main maintainer). Just wanted to chime in here and add a little to what @MVrachev says above, in no particular order:
The decision is entirely yours, I was just interested enough in the discussion to add my own perspective. Good luck. |
Signed-off-by: Martin Vrachev <[email protected]>
Thanks for your comments, Samuel. While I (of course) understand that no complicated software is w/o 3rd-party deps, let me try to clarify where I'm coming from:
So, great to hear that you are planning to continue actively working on pydantic, and this was not a personal attack on you or your project, but I hope you understand where I'm coming from. |
Signed-off-by: Martin Vrachev <[email protected]>
Signed-off-by: Martin Vrachev <[email protected]>
After Joshua's feedback, I made a couple of additional (hopefully final) changes:
I know that this pr grew a lot and I changed many things in it, but please have a final look. PS: I checked how many lines |
Hi @trishankatdatadog, no problem, I get entirely where you're coming from. I actually don't think a |
743b4da
to
b486d24
Compare
b486d24
to
b37ebf7
Compare
Signed-off-by: Martin Vrachev <[email protected]>
b37ebf7
to
0559bbc
Compare
After this research and all available options I think that:
but will add (using my rough estimates) around 9000 lines of code (with the additional dependency
but we have to be conscious that All other options have a not negligible list of disadvantages compared to those 2. |
# Be careful if you rely on _validate_id() to verify self.id! | ||
# This won't be called if new_name is "". | ||
self.validate() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the problem is related to different code paths, but rather that User.validate()
is simply not suited for non-user input validation. In your example you are not validating inputs, instead you are validating a user object after having performed some unvetted modifications.
If you want to use the function for input validation you have to do it when the input is a user object, e.g.:
def print_user(user: User):
user.validate()
print(user)
In either case you have to make sure to actually call the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't do argument validation.
Your argument showcases what would happen if you pass an object implementing validate()
, but it's not always like that and for simple types like int
, str
, etc. I don't expect us to create custom classes.
With this argument, I wanted to stress my argument, that because you have no function argument validation
and no validation on assignment you can change your class attributes and forget to call validate()
in the end.
With the `in-toto` implementation of the `ValidationMixin`, we can only validate | ||
class attributes inside class methods. | ||
If we want to validate functions outside classes or function arguments we would | ||
have to enhance this solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by "validate functions", but as you can see in my print_user
example above, you can validate function arguments outside of classes. But yes, you can only validate objects that implement a validate
method.
Signed-off-by: Martin Vrachev <[email protected]>
Martin, thanks a ton for the detailed investigation and your assessment. Given all the things I learned from the discussions here and in #1130, I took the liberty to take a step back and try to re-think what we actually need and how we can achieve it with the tools at hand, and without large dependencies (pydantic, marshmallow) or magic power features (descriptors). I am also aware that this might only structure the discussion in my mental model, and not in others' . So if you have the feeling that my comment rather derails the discussion, please disregard it. Validation Use Cases
Validation Tools
Suitable Tools per Use Case
Demonstration # Exemplary simple validator (validators.py)
def version(val: int):
# TODO: check type first (?)
if val < 0:
raise ValueError("version must be >= 0")
# Exemplary TUF class that implements ValidationMixin (simplified for emphasis)
class Root(ValidationMixin):
# ...
def __init__(self, version: int, keys: Keys):
# TODO: check types first (?)
tuf.validators.version(version)
keys.validate()
# ... assign validated attributes
def _validate_keys(self):
# TODO: check type first (?)
self.keys.validate() # recursively use ValidationMixin for class model tree
def _valiate_version(self):
validators.version(self.version) # use simple validators for "leaves" in the class model tree
# Example for use case 1
# Manually compose metadata and validate conformance,
# e.g. in an explorative context (tutorial, initial repo setup, etc.)
>>> my_root = Root()
# ...
>>> my_root.version = 1
>>> my_root.validate()
# Example for use cases 2.i.*
def assign_new_version_and_print_message__yes_it_is_silly(root: Root, version: int, message: str):
# TODO: check all types first (?)
root.validate()
tuf.validators.version(version)
root.version = version
print(message)
# Example for use case 2.ii. (simplified for emphasis)
def deserialize_json(data):
# Delegate validation to json.loads and Root.from_dict --> Root.__init__
Root.from_dict(json.loads(data)) Some more thoughts
Such a decorator could also fail with a better error than |
Honestly, I don't see descriptors as so magical.
Seems to me that those are our use cases, yes.
The duct typing principle is well suited for objects, but for simple types, I am not so convinced.
As I said before and I will (at least try) to repeat myself for last time, but automatic validation on each assignment is a lot better |
This is a tricky one to review... My input below is unlikely to lead to a specific 'decision outcome' but I think it's the best I can do. I think the text concentrates a lot on the "tools" (modules and dependencies): they are obviously important but they are just ways to get to the actual goal (figuring out what are the important things to validate, how to efficiently validate them, and making the process easy to replicate). Very briefly on the considered modules:
I think I pretty much agree with Lukas' long post, but some specific opinions:
|
Thanks again for all the research here @MVrachev, and for the detailed review @lukpueh and @jku. Here's my attempt to summarise the discussion so far: What decisions are made today:
What decisions are not yet made:
For ADR0007 and this PR we could either a) update the ADR to capture only the decisions where we have agreement, and merge OR b) convert the PR to draft while we try to resolve the additional items. Based on the discussion here, I think good next steps would be:
|
I think it will be better if we keep this pr as a draft and open it until we have a final decision.
I will create validation functions and will experiment using descriptors who will utilize them. |
I'm going to close this: it's valuable work but not for merging. Maybe we are at a point where we can evaluate our validation and needs again (like do we need serialization-time validation or do we have validation code that could be separated out of the object construction/deserialization) but that doesn't require this PR to stay open for months |
Related to #1130
Description of the changes being introduced by the pull request:
Add a decision record about the different options which we can use
to validate the new code under tuf.api.
I made small prototypes to showcase the different options.
You can have a look at them in my branches:
TODO: Make a decision.
Signed-off-by: Martin Vrachev [email protected]
Please verify and check that the pull request fulfills the following
requirements: