Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR0007: How we are going to validate the new API codebase #1301

Conversation

MVrachev
Copy link
Collaborator

@MVrachev MVrachev commented Mar 10, 2021

Related to #1130

Description of the changes being introduced by the pull request:
Add a decision record about the different options which we can use
to validate the new code under tuf.api.

I made small prototypes to showcase the different options.
You can have a look at them in my branches:

TODO: Make a decision.

Signed-off-by: Martin Vrachev [email protected]

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

@MVrachev MVrachev marked this pull request as draft March 10, 2021 15:21
@MVrachev MVrachev force-pushed the adr-7-validation-guideliness branch 2 times, most recently from 5c29e1e to 5b682ec Compare March 10, 2021 15:51
Add a decision record about the different options which we can use
to validate the new code under tuf.api.

TODO: Make decision.

Signed-off-by: Martin Vrachev <[email protected]>
@MVrachev MVrachev force-pushed the adr-7-validation-guideliness branch from 5b682ec to 4b2f28c Compare March 10, 2021 21:05
@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 11, 2021

I realized there is a fundamental difference between the approaches of the ValidationMixin and pydantic together with the way we are currently usingsecuresystemslib.schema.
The pydantic approach is to validate all inputs pre attribute modification or before the function is started.

On another hand, the ValidationMixin approach focuses does validation post attribute modification
or as I understand it, the validate() is called when all class attributes are modified.

I am a little worried about the second approach.
I ask myself questions like:
What if we make changes to a file/database during the function execution and then when validating we throw an exception?
What would happen to those external resourses?

Personally, I will feel safer if we do pre modification checks and block execution early in our functions.
We can think of an alternative way to achieve this if we decide we want to implement it on our own.

@jku
Copy link
Member

jku commented Mar 11, 2021

The pydantic approach is to validate all inputs pre attribute modification or before the function is started.

On another hand, the ValidationMixin approach focuses does validation post attribute modification
or as I understand it, the validate() is called when all class attributes are modified.

by default pydantic only checks at object initialization time as well, right? You need to explicitly markup your setters with @pydantic.validate_arguments() to get function argument validation (which actually seems to be a "beta" feature). I'm a bit confused how it actually does it... does your change_spec_version() properly trigger the spec_version validator? EDIT: I see you test that it does get called and that changing the property also triggers it...

Generally speaking I'm probably going to be grumpy about any runtime dependencies for this: pydantic as an example is 7700 lines of code. Someone will have to explain to pip developers how vendoring that is going to make their life better...

I know that the current checks may not be the best things to measure against but ... I wonder if it would help to try to quantify the existing schema checks and the validation they do: Quick grep/sed/sort/uniq (that probably missed some things) says there aren't that many schemas that are really re-used extensively in tuf source code:

  (...long tail skipped...)
  5 ANYKEY_SCHEMA
  5 ISO8601_DATETIME_SCHEMA
  5 SIGNABLE_SCHEMA
  6 METADATAVERSION_SCHEMA
  9 RELPATH_SCHEMA
 14 ROLENAME_SCHEMA
 23 BOOLEAN_SCHEMA
 33 PATH_SCHEMA
 45 NAME_SCHEMA

Reviewing the most used ones carefully might be a good idea: Is the check useful? what exactly gets checked when check_match() is called in this case? Can our proposal provide a good solution for this particular check -- is our proposal clearly better than status quo?

Some questions I've had, just thinking out loud:

  • Is the objective to get rid of schemas or not? The problem statements seems to revolve around schemas being a non-optimal solution for the problem but at least the ValidationMixin example uses schemas extensively... I understand the short- and long-term answers to the question may be different but the goal should be stated I think
  • Does the use of type hints allow us to essentially just drop a number of schema checks? Do we have to design our code in a specific way to get these benefits?
  • is the optimal design that we have a full validation at Metadata object initialization, and then on property setters we decide case-by-case whether to validate the whole object or just the input or something in between?
  • Are the choices made here going to be usable for rest of the code? I mean code that is not part of the API but might still need to do checks... Teodoras client refactor is defintely going to make decisions on this already

@trishankatdatadog
Copy link
Member

Let's get rid of the schemas. This is Python, not Protobuf.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 11, 2021

by default pydantic only checks at object initialization time as well, right? You need to explicitly markup your setters with @pydantic.validate_arguments() to get function argument validation (which actually seems to be a "beta" feature). I'm a bit confused how it actually does it... does your change_spec_version() properly trigger the spec_version validator? EDIT: I see you test that it does get called and that changing the property also triggers it...

I realized I made my code little confusing.
I don't need the @pydantic.validate_arguments() decorator from the def change_spec_version(self, new_spec_ver: str).
The validation happens because I am assigning a new value to the self.spec_version.
I will add a couple of more tests to showcase the limits of pydantic.

Will get back to your other thoughts later when I do a little more research.

Strict types are available in pydantic:
https://pydantic-docs.helpmanual.io/usage/types/#strict-types
it's just sad there is no class-wide strict mode implemented yet:
See: pydantic/pydantic#1098

Signed-off-by: Martin Vrachev <[email protected]>
@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 15, 2021

Some good news:
I have been mistaken about pydantic strict mode and they do support strict types.
They have StrictStr, StrictBytes, StrictInt, StrictFloat, and StrictBool as strict types.
I added a new commit to my pydantic branch to show how it works for two of those strict types:
MVrachev@d1bb21f

It's just sad there is no class-wide strict mode implemented yet which will automatically treat a standard arguments
as strict arguments:
See: pydantic/pydantic#1098

PS: I pushed a commit in the adr clarifying this.

Copy link
Member

@lukpueh lukpueh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the detailed research, @MVrachev! Here are some of my thoughts...

docs/adr/0007-validation-guideliness.md Show resolved Hide resolved
docs/adr/0007-validation-guideliness.md Show resolved Hide resolved
docs/adr/0007-validation-guideliness.md Outdated Show resolved Hide resolved
multiple fields.

* Good, because it allows reuse of the validation code through
`securesystemslib.schemas` or another schema of our choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are focusing too much on securesystemslib.schemas. There is an unspoken consensus that we want to get rid of it (see secure-systems-lab/securesystemslib#183). And I don't think that its use is a defining aspect of the ValidationMixin. IIRC we only used them in in-toto because it was already there and we lacked a comprehensive class model for all complex data types.

Take for instance the _validate_keys method on the in-toto Layout class:

  def _validate_keys(self):
    securesystemslib.formats.ANY_PUBKEY_DICT_SCHEMA.check_match(self.keys)

Now if we had a PublicKey class with its own validators -- something we intend to add as per ADR4 particularly for the purpose of validation -- then the validator would probably look like this:

  def _validate_keys(self):
    if not isinstance(self.keys, dict):
      raise ...

    for keyid, key in self.keys.items():
      if not isinstance(key, PublicKey):
        raise ...
      key.validate()

      # NOTE: Even though ADR4 only talks about classes for complex attributes
      # it probably makes sense to add a `KeyID` class as well just so that we
      # have a reusable way of checking hex strings of a certain length.
      if not isinstance(keyid, KeyID):
        raise ...
      keyid.validate()

Or we could add a PublicKeyDict class, and do:

  def _validate_keys(self):
    if not isinstance(self.keys, PublicKeyDict):
      raise ...
    self.keys.validate()

Or, and that's my preferred way, we let a type annotation checker do the basic type checking for us, so that we always know that an in-toto layout's pubkeys variable contains a valid Dict[KeyID, PublicKey] value. Then we can use the _validate_* for more complex checks, such as:

  def _validate_keys(self):
    # We already know that 'self.pubkeys' is a valid Dict[KeyID, PublicKey]
    # value so no need to check this here...

    # ... but we also want to make sure that we don't have duplicate keys.
    if len(self.pubkeys.keys()) != len(set(self.pubkeys.keys())):
      raise ...

I'd say the ValidationMixin really just provides a shortcut -- i.e. validate() -- to all _validate_* methods on a class, which seems very similar to pydantic's @validator decorator (please correct me if I'm wrong!!)

So the main question to me is, is that decorator better than our custom _validate_* convention, and does pydantic have other features (that we can't just easily add to our ValidationMixin) that justify a +7K LOC dependency. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed: we should use ideally no 3rd-party dependency.

Copy link
Collaborator Author

@MVrachev MVrachev Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update the ADR to not sound like the idea with ValidationMixin fully depended on the schemas, but
it seems it didn't sound that way :D.

I understand the idea with the ValidationMixin and it has good points.
If we decide to go into that path, probably it's fixing the existing schemas instead of straightly removing them.

Or, and that's my preferred way, we let a type annotation checker do the basic type checking for us,

@lukpueh which type of annotation checker do you use in in-toto?
How many dependencies will it add into tuf and (if you can easily check that) how many lines of code?

As I said in my detailed comment here #1301 (comment) there are more useful features in pydantic which could be useful for us.

@lukpueh
Copy link
Member

lukpueh commented Mar 16, 2021

  (...long tail skipped...)
  5 ANYKEY_SCHEMA
  5 ISO8601_DATETIME_SCHEMA
  5 SIGNABLE_SCHEMA
  6 METADATAVERSION_SCHEMA
  9 RELPATH_SCHEMA
 14 ROLENAME_SCHEMA
 23 BOOLEAN_SCHEMA
 33 PATH_SCHEMA
 45 NAME_SCHEMA

Thanks for the quick grep/sed/sort/uniq-ing, @jku! This is good data to backen my first concern in secure-systems-lab/securesystemslib#183, i.e. schemas sound more specific than they are. Four of the five most popular schemas just check if a value is a string (and only one if it's not empty). I think we are better off with type hints there.

Add summaries of how our options compare against our requirements.

Signed-off-by: Martin Vrachev <[email protected]>
@MVrachev MVrachev force-pushed the adr-7-validation-guideliness branch from da588e5 to c6c706c Compare March 17, 2021 16:26
@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 17, 2021

I added two additional commits.

In the first one, I address some of Lukas comments and more importantly adding two additional requirements.
Also, I decided to create a small table summarizing how each of the options is performing against our requirements.
Finally, in this commit, I added two pros to the pydantic section.

In the second commit, I documented my observations on our third option - marshmallow. I have created a small prototype
for it the way I did with the other two.
You can find the link to my branch in the updated pr description.

Before I give my opinion, I will document one final notable option - typical.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 17, 2021

After a good look at typical, I realized that it's probably not worth considering as an option.

The main problems I have with it are that:

  1. It's a relatively unknown library with only 111 stars on GitHub.
  2. It doesn't allow for custom validators. It's focused on type checking and it allows for some constraints
    on the function arguments/class attributes, but nothing fancier than that. Adding the possibility for custom validators
    is planned for future versions.
  3. It's a one-man show with only one maintainer actively working on the project which is a big red sign for PyPI.
  4. It adds 4 additional dependencies pytzdata, pendulum, inflection, and itself - typical.
  5. It doesn't support python 3.6.

I just understood, that there is a way to invoke all validators with
pydantic whenever we want with a helper function.

Correct the ADR with this new information and fix some typos.

Signed-off-by: Martin Vrachev <[email protected]>
There are multiple limitations in the "typical" library to consider it
as an option for our new API which is going to be used from
variety of projects, some of which are big like PyPI.

The main problems I have with it are:
1. It's a relatively unknown libr1ary with only 111 stars on GitHub.
2. It doesn't allow for custom validators. It's focused on type checking
and it allows for some constraints on the function arguments/class
attributes, but nothing fancier than that. Adding the possibility for
custom validators is planned for future versions.
3. It's a one-man show with only one maintainer actively working on the
project which is a big red sign for PyPI.
4. It adds 4 additional dependencies pytzdata, pendulum, inflection,
and itself - typical.
5. It doesn't support python 3.6.

Signed-off-by: Martin Vrachev <[email protected]>
@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 18, 2021

If nobody has any other suggestions for libraries/options worth exploring, I conclude my research.
Three options were documented: ValidationMixin, pydantic and marshmallow and a couple more were researched which didn't meet our requirements.

I updated the pydantic branch with more examples and added a new branch for marshmallow (both linked above).
So for those of you who have reviewed it already, please have a new fresh look at the ADR.
I changed many things in it from the initial version.

@MVrachev MVrachev marked this pull request as ready for review March 18, 2021 14:45
Signed-off-by: Martin Vrachev <[email protected]>
@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 18, 2021

From the third-party solutions, I strongly preffer pydantic over marshmallow.
pydantic is more intuitive, provides more validation features,
and is more focused on validation than marshmallow is.

The big question is: do we want to pay the price of adding two additional dependencies
instead of us fixing or rewriting the workaround schemas and
using the ValidationMixin.

The useful features that are implemented in pydantic, but are missing from
the ValidationMixin are the following:

  1. Flexibility to validate function arguments everywhere - for class methods
    or regular functions.
  2. Custom pydantic types which gives us free validations with minimal code.
  • strict types forbidding conversion: StrictInt (for versions), StrictStr, StrictBool (for flags),
    StrictBytes, etc.
  • types with their built-in validation PositiveInt, FilePath/DirectoryPath
    (automatically checks if a file/directory exists), etc.
  • conint, constr, etc. types with inline custom constraints.
    For example if you want a strict positive int you would write:
    conint(gt=0, strict=True) or if you want to validate a 256 HEX string:
    constr(min_length=256, max_length=256, strict=True)
    and finally, if you want to validate the _type field:
    constr(regex=r'(root|timestamp|snapshot|targets)')
  1. Enforce constraints on function arguments beyond type checks:
    You can put constraints on your function arguments with a Field class.
    For example, a function with an argument with strict int between 0 and 10 will be:
    def change_version(version: conint(gt=0, le=10, strict=True)) .
  2. We can easily enforce all of our objects to be fully populated and valid
    during their lifespan by marking our fields (class attributes) as required
    and enforcing assignment validation.
    Additionally, each time we call validate_model() we would receive all validation
    errors plus errors for required class attributes that are missing.
  3. Custom class configuration options like:
  • validate_assignment enforce validation on assignment
  • extra: defines strategy when new unknown attributes appear in model initialization.
    Could be "ignore", "allow", or "forbid".
  • allow_mutation if set to False doesn't allow to add new attributes
    with __setattr__

Finally, I would add that when adding a new dependency there will be a price
to be paid, but maintaining a good validation functionality isn't cheap either.

pydantic seems like a popular option for a validation library with a steady
set of contributions and if bugs or security vulnerabilities emerge, most likely
will be fixed fast.
Twice I asked questions and they were answered under 24 hours: 1 and 2.

@trishankatdatadog
Copy link
Member

Thanks for your investigations, Martin!

I hate to sound cynical, but I'm wary of any project (such as pydantic) that depends on one developer...

@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 19, 2021

Thanks for your investigations, Martin!

I hate to sound cynical, but I'm wary of any project (such as pydantic) that depends on one developer...

@trishankatdatadog I don't agree with the statement that pydantic fully depends on one developer.
There are a couple of other developers actively working on the project and participating in issue discussions and even one of them PrettyWood contributes and participates in discussion more than the creator of pydantic - samuelcolvin lately:
image

So, if one-day @samuelcolvin don't have time to work on pydantic he can pass it to @PrettyWood.
Additionally, there are more than 20 GitHub sponsors for @samuelcolvin so there is an incentive to continue the project development.
The only risk left is if @samuelcolvin wakes up angry at the world (or get his account compromised) and decides to delete
the GitHub project, but even then there would be people knowing enough about the project to continue its development in one shape or another.

@joshuagl
Copy link
Member

joshuagl commented Mar 22, 2021

Thank you for digging in and researching these options, and the attention to detail in the presentation @MVrachev! The rich diff with the tables makes the information easier to process.

There are a couple of things that were noted in the initial Issue which I think are missing from this discussion:

  1. what options do we have if we want to do runtime type checking? (at our public API boundary – per the various discussions on schema, I do not think we need to type check all of the data we are generating and passing around internally in tuf)
    i. are there third-party runtime type checkers which are small and well maintained that we can incorporate?
    ii. are we best off doing runtime type checking ourselves? Perhaps a decorator (which has the added advantage of explicitly marking public API) that will iterate a methods annotations and check that each argument was of the appropriate type (we might implement this using typing.get_type_hints(), or (less likely) inspect, or perhaps with something lower level like manually iterating the function's __annotations__).
  2. If we opt to implement our own input validation code, what standard library features can we leverage to implement the code in a clean, concise, and pythonic way? There was a brief suggestion of using descriptors for attribute validation in Add validation guidelines #1130.

@samuelcolvin
Copy link

samuelcolvin commented Mar 22, 2021

Thanks for considering pydantic (I'm the main maintainer).

Just wanted to chime in here and add a little to what @MVrachev says above, in no particular order:

  • as mentioned above, I'm not the only active developer of pydantic - there are others who know it well enough to take over if I stopped
  • I have no intention whatsoever of stopping my work on pydantic, if anything I'm hoping to spend more time on it in the future
  • Multiple large organisations use pydantic, including: microsoft (as part of office and in azure at least), amazon, facebook, netflix, uber, IBM, datadog, JP Morgan (that's just the ones who use it publicly, or forgot to add noreferrer to their internal issue tracker, so I can see referrals to the docs 😉) - there's quite a few people who rely on pydantic, it seems extremely likely that it would continue to be maintained if I got run over by a bus
  • If you want to build a system that has no dependency (or part of a dependency) which is primarily developed by one person, you had better start from silicon and work your way up - virtually all software has components which only one person has ever fully understood. That doesn't mean someone couldn't understand it if they had to.
  • Even projects associated with large corporations often have a very high "bus factor" as they are primarily the fiefdom of one developer, if that developer leaves, management may not see the point in putting the same resources behind the project going forward

The decision is entirely yours, I was just interested enough in the discussion to add my own perspective. Good luck.

@trishankatdatadog
Copy link
Member

Just wanted to chime in here and add a little to what @MVrachev says above, in no particular order:

Thanks for your comments, Samuel. While I (of course) understand that no complicated software is w/o 3rd-party deps, let me try to clarify where I'm coming from:

  • pydantic is still under your individual account AFAICT. A dedicated org might help assuage others against recent incidents (not naming names, and not suggesting you will do the same, but some open-source libraries/extensions have been suddenly transferred to others).
  • I still think we can make do w/o using a 3rd-party library that may (or may not) be too complicated for our use case. Cryptography is one thing, validating input another. I have shown how a handwritten parser can do type-checking at the same time w/o too much work.

So, great to hear that you are planning to continue actively working on pydantic, and this was not a personal attack on you or your project, but I hope you understand where I'm coming from.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 24, 2021

Thank you for digging in and researching these options, and the attention to detail in the presentation @MVrachev! The rich diff with the tables makes the information easier to process.

There are a couple of things that were noted in the initial Issue which I think are missing from this discussion:

  1. what options do we have if we want to do runtime type checking? (at our public API boundary – per the various discussions on schema, I do not think we need to type check all of the data we are generating and passing around internally in tuf)
    i. are there third-party runtime type checkers which are small and well maintained that we can incorporate?
    ii. are we best off doing runtime type checking ourselves? Perhaps a decorator (which has the added advantage of explicitly marking public API) that will iterate a methods annotations and check that each argument was of the appropriate type (we might implement this using typing.get_type_hints(), or (less likely) inspect, or perhaps with something lower level like manually iterating the function's __annotations__).
  2. If we opt to implement our own input validation code, what standard library features can we leverage to implement the code in a clean, concise, and pythonic way? There was a brief suggestion of using descriptors for attribute validation in Add validation guidelines #1130.

After Joshua's feedback, I made a couple of additional (hopefully final) changes:

  1. Decided to compare all of our options against all of our requirements in a single table.
    I believe this is a lot better and we should include it in future ADRs.
    Thanks to @avelichka for the idea!
  2. I have done research on a runtime type check library called typeguard and on
    python decorators. I combined both to formulate a new balanced option between using a third-party tool and our custom validators.
    You can find my branch with examples linked in the pr description.
  3. Finally, I added one additional hybrid option using again typeguard and the ValidationMixin.
    I decided that there is no sense in creating a prototype for that option, given that I have documented
    both of those components in other options.

I know that this pr grew a lot and I changed many things in it, but please have a final look.
I really hope no other options would appear and we can continue with the discussion.

PS: I checked how many lines typeguard is by using sloccount and it seems it is
a little above 2400.

@samuelcolvin
Copy link

Hi @trishankatdatadog, no problem, I get entirely where you're coming from.

I actually don't think a @pydantic-code organisation with a couple of admins is really any better than it being under my own account, but I understand why it looks more serious. I do intend to move pydantic one day, but that's another story and not likely in the near future.

@MVrachev MVrachev force-pushed the adr-7-validation-guideliness branch from 743b4da to b486d24 Compare March 30, 2021 10:50
@MVrachev MVrachev force-pushed the adr-7-validation-guideliness branch from b486d24 to b37ebf7 Compare March 30, 2021 11:11
@MVrachev MVrachev force-pushed the adr-7-validation-guideliness branch from b37ebf7 to 0559bbc Compare March 30, 2021 11:12
@MVrachev
Copy link
Collaborator Author

MVrachev commented Mar 30, 2021

After this research and all available options I think that:

  1. pydantic is the best third-party solution and a great choice, because
  • is the easiest to use of all
  • adds a ton of customization (see my comment and their documentation)
  • enforce assignment validation on class attributes
  • does all things we want
  • is maintained by more than 1 developer

but will add (using my rough estimates) around 9000 lines of code (with the additional dependency typing_extenstions) and 26 files to be vendored into pip.

  1. The best balance between add as little as possible additional dependencies and ease of use is to use typeguard + Descriptors because:
  • we get strict type checking for all kind of functions and still can decide to not validate when we want
  • with python descriptors we can mimic the custom restrictions we place on class attributes as in my code examples the predicate argument for String.
  • enforce assignment validation on class attributes
  • it adds one really small dependency adding (using my rough estimates) 900 lines of code and 3 files to be vendored in pip

but we have to be conscious that typeguad is maintained by one contributor and we have work to do to implement our own
descriptor classes.

All other options have a not negligible list of disadvantages compared to those 2.

# Be careful if you rely on _validate_id() to verify self.id!
# This won't be called if new_name is "".
self.validate()
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem is related to different code paths, but rather that User.validate() is simply not suited for non-user input validation. In your example you are not validating inputs, instead you are validating a user object after having performed some unvetted modifications.

If you want to use the function for input validation you have to do it when the input is a user object, e.g.:

def print_user(user: User):
  user.validate()
  print(user)

In either case you have to make sure to actually call the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't do argument validation.
Your argument showcases what would happen if you pass an object implementing validate(), but it's not always like that and for simple types like int, str, etc. I don't expect us to create custom classes.

With this argument, I wanted to stress my argument, that because you have no function argument validation
and no validation on assignment you can change your class attributes and forget to call validate() in the end.

With the `in-toto` implementation of the `ValidationMixin`, we can only validate
class attributes inside class methods.
If we want to validate functions outside classes or function arguments we would
have to enhance this solution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by "validate functions", but as you can see in my print_user example above, you can validate function arguments outside of classes. But yes, you can only validate objects that implement a validate method.

docs/adr/0007-validation-guideliness.md Outdated Show resolved Hide resolved
Signed-off-by: Martin Vrachev <[email protected]>
@lukpueh
Copy link
Member

lukpueh commented Mar 31, 2021

Martin, thanks a ton for the detailed investigation and your assessment. Given all the things I learned from the discussions here and in #1130, I took the liberty to take a step back and try to re-think what we actually need and how we can achieve it with the tools at hand, and without large dependencies (pydantic, marshmallow) or magic power features (descriptors).

I am also aware that this might only structure the discussion in my mental model, and not in others' . So if you have the feeling that my comment rather derails the discussion, please disregard it.


Validation Use Cases

  1. validation of TUF metadata object conformance / compliance (a safety requirement)
  2. validation of untrusted inputs (a security requirement)
    1. function arguments
      1. complex data types for which TUF metadata model classes exist
      2. simple data types with special semantics
      3. simple data types
    2. TUF metadata files

Validation Tools

  • simple validators: a module with reusable validation functions for simple data types that are not represented by a class but need extra semantic validation in addition to data type validation, e.g. keyids are hex strings of a certain length, version numbers are integers greater than 0, etc.
  • validation mixin: see in-toto, may use simple validators for "leaves" in the class model tree
  • type checks

Suitable Tools per Use Case

  1. conformance / compliance --> validation mixin
  2. untrusted inputs
    1. function arguments
      1. complex data types --> validation mixin
      2. simple data types with special semantics --> simple validators
      3. simple data types --> type check
    2. TUF metadata files --> delegates validation to class model functions/constructors (i.e. use cases 2.i.)

Demonstration

# Exemplary simple validator (validators.py)
def version(val: int):
  # TODO: check type first (?)
  if val < 0:
    raise ValueError("version must be >= 0")



# Exemplary TUF class that implements ValidationMixin (simplified for emphasis)
class Root(ValidationMixin):
  # ...
  def __init__(self, version: int, keys: Keys):
    # TODO: check types first (?)
    tuf.validators.version(version)
    keys.validate()
    # ... assign validated attributes

  def _validate_keys(self):
    # TODO: check type first (?)
    self.keys.validate() # recursively use ValidationMixin for class model tree

  def _valiate_version(self):
    validators.version(self.version) # use simple validators for "leaves" in the class model tree



# Example for use case 1
# Manually compose metadata and validate conformance, 
# e.g. in an explorative context (tutorial, initial repo setup, etc.)
>>> my_root = Root()
# ...
>>> my_root.version = 1
>>> my_root.validate()


# Example for use cases 2.i.*
def assign_new_version_and_print_message__yes_it_is_silly(root: Root, version: int, message: str):
  # TODO: check all types first (?)
  root.validate()
  tuf.validators.version(version)
  root.version = version
  print(message)


# Example for use case 2.ii. (simplified for emphasis)
def deserialize_json(data):
   # Delegate validation to json.loads and Root.from_dict --> Root.__init__
  Root.from_dict(json.loads(data))

Some more thoughts

  • How much type checking is really needed? Isn't it okay to just fail with TypeError if above validator.version(val: int) is called with a value that does not support '<', or to fail with an AttributeError if we call validate() on a value that does not implement such a method? Do I really need to check every message to be printed is a string? AFAIK print eats pretty much anything.
    --> see duck typing and EAFP Python principles!!

  • If we check types we should use a decorator that iterates over arguments and their type annotation.

  • If we use such a decorator, typeguard seems like a good choice, but is it really worth adding the dependency, or should we just borrow the @check_type decorator?

  • We could also create a simple custom decorator, where we register:

    • arguments for which validate from the mixin should be called,
    • arguments for which a validation function should be called
    • (and optionally arguments whose type should be checked).

Such a decorator could also fail with a better error than AttributeError, if an argument is expected to implement validate but does not.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Apr 1, 2021

Martin, thanks a ton for the detailed investigation and your assessment. Given all the things I learned from the discussions here and in #1130, I took the liberty to take a step back and try to re-think what we actually need and how we can achieve it with the tools at hand, and without large dependencies (pydantic, marshmallow) or magic power features (descriptors).

Honestly, I don't see descriptors as so magical.
Descriptors is a fancy word for overwriting __set_name__, __get__ and __set__.
Overwriting __get__ and __set__ is not a new idea, given that setters and getters are used in probably all object-oriented programming languages, python just automatically calls them for you.
Additionally, using the decorators @property and @<variable_name>.setter are other well-known ways which are overriding
the set and get functionality.

I am also aware that this might only structure the discussion in my mental model, and not in others' . So if you have the feeling that my comment rather derails the discussion, please disregard it.

Validation Use Cases

  1. validation of TUF metadata object conformance / compliance (a safety requirement)

  2. validation of untrusted inputs (a security requirement)

    1. function arguments

      1. complex data types for which TUF metadata model classes exist
      2. simple data types with special semantics
      3. simple data types
    2. TUF metadata files

Seems to me that those are our use cases, yes.

Validation Tools

  • simple validators: a module with reusable validation functions for simple data types that are not represented by a class but need extra semantic validation in addition to data type validation, e.g. keyids are hex strings of a certain length, version numbers are integers greater than 0, etc.
  • validation mixin: see in-toto, may use simple validators for "leaves" in the class model tree
  • type checks

Suitable Tools per Use Case

  1. conformance / compliance --> validation mixin

  2. untrusted inputs

    1. function arguments

      1. complex data types --> validation mixin
      2. simple data types with special semantics --> simple validators
      3. simple data types --> type check
    2. TUF metadata files --> delegates validation to class model functions/constructors (i.e. use cases 2.i.)

Demonstration

# Exemplary simple validator (validators.py)
def version(val: int):
  # TODO: check type first (?)
  if val < 0:
    raise ValueError("version must be >= 0")



# Exemplary TUF class that implements ValidationMixin (simplified for emphasis)
class Root(ValidationMixin):
  # ...
  def __init__(self, version: int, keys: Keys):
    # TODO: check types first (?)
    tuf.validators.version(version)
    keys.validate()
    # ... assign validated attributes

  def _validate_keys(self):
    # TODO: check type first (?)
    self.keys.validate() # recursively use ValidationMixin for class model tree

  def _valiate_version(self):
    validators.version(self.version) # use simple validators for "leaves" in the class model tree



# Example for use case 1
# Manually compose metadata and validate conformance, 
# e.g. in an explorative context (tutorial, initial repo setup, etc.)
>>> my_root = Root()
# ...
>>> my_root.version = 1
>>> my_root.validate()


# Example for use cases 2.i.*
def assign_new_version_and_print_message__yes_it_is_silly(root: Root, version: int, message: str):
  # TODO: check all types first (?)
  root.validate()
  tuf.validators.version(version)
  root.version = version
  print(message)


# Example for use case 2.ii. (simplified for emphasis)
def deserialize_json(data):
   # Delegate validation to json.loads and Root.from_dict --> Root.__init__
  Root.from_dict(json.loads(data))

Some more thoughts

  • How much type checking is really needed? Isn't it okay to just fail with TypeError if above validator.version(val: int) is called with a value that does not support '<', or to fail with an AttributeError if we call validate() on a value that does not implement such a method? Do I really need to check every message to be printed is a string? AFAIK print eats pretty much anything.
    --> see duck typing and EAFP Python principles!!

The duct typing principle is well suited for objects, but for simple types, I am not so convinced.
My concern is about conversations like: float -> int, str -> int, etc.
We working on a security project and I believe that requiring strict types where it makes sense is a logical validation step.

  • If we check types we should use a decorator that iterates over arguments and their type annotation.

  • If we use such a decorator, typeguard seems like a good choice, but is it really worth adding the dependency, or should we just borrow the @check_type decorator?

  • We could also create a simple custom decorator, where we register:

    • arguments for which validate from the mixin should be called,
    • arguments for which a validation function should be called
    • (and optionally arguments whose type should be checked).

Such a decorator could also fail with a better error than AttributeError, if an argument is expected to implement validate but does not.

As I said before and I will (at least try) to repeat myself for last time, but automatic validation on each assignment is a lot better
then relying on the developer to remember to place validate() at the begging/end of the function.
There are soo many prs/commits we pushed because we forgot to add something small or made a small mistake: #1302, #1226, cf49021, bf35723 etc.

@jku
Copy link
Member

jku commented Apr 9, 2021

This is a tricky one to review... My input below is unlikely to lead to a specific 'decision outcome' but I think it's the best I can do.

I think the text concentrates a lot on the "tools" (modules and dependencies): they are obviously important but they are just ways to get to the actual goal (figuring out what are the important things to validate, how to efficiently validate them, and making the process easy to replicate).

Very briefly on the considered modules:

  • Thanks for the comparison data, very useful
  • I think with pip integration in mind the bar for new dependencies should be really, really high (for those not aware, the plan is to include TUF and all its dependencies in pip source code). TUF is already a huge amount of code, I don't want it to be larger. This applies to all the tools but mostly of course pydantic
  • typeguard is the one I looked at more and it looks a bit magical to me: the frame inspection (and frame modification!) feels like it's a bit much when we mostly want to validate API input. I understand that's what's needed to get that sort of integration but... If we go with this I'd like to see more testing on performance and supported features (just as an example it seems to choke on string defined class names def func(arg: 'MyClass'):)

I think I pretty much agree with Lukas' long post, but some specific opinions:

  • Let's not add new dependencies based on current info
  • Let's start adding case-by-case validation that works for that case (Lukas talked about some cases, won't repeat here). Most important is to handle the API input validation (so typically the metadata constructors)
  • Should document best practices as they're found
  • I'm fine with experimenting with decorators to make validation more reliable: I do not believe we can make a good decision either way here without having very specific and real examples
  • I'm fine with experimenting with descriptor based validators -- but let's decide when we have a real case of descriptor validator vs normal setter functions that validate. Descriptors are definitely more magical than setters in my mind (as the code that gets run on attribute set is not in the class where the attribute is: it's in the class that implements the validator. This is pretty much hidden from someone reading the top level class code). Still they may be worth trying
  • I'm not sold on the idea that we need to add strict type checking e.g. for function arguments in general: It's explicitly against the language design, it's not what python developers expect and I don't think we'll manage to do it with reasonable amount of effort. For API input, so metadata constructor arguments (and Signature, etc), some instance type checking could be a good idea... but I don't know if a blanket rule is appropriate?

@joshuagl
Copy link
Member

Thanks again for all the research here @MVrachev, and for the detailed review @lukpueh and @jku.

Here's my attempt to summarise the discussion so far:

What decisions are made today:

  • Agreement that we need to be able to validate metadata object conformance
  • Agreement that we need to be able to validate untrusted inputs: function arguments for user API, untrusted metadata files (ones we do not create ourselves)
  • Strong desire to avoid any additional dependencies

What decisions are not yet made:

  • How to integrate validation: MixIn's, descriptors, decorators
  • Whether there is any value in/desire for runtime type checking

For ADR0007 and this PR we could either a) update the ADR to capture only the decisions where we have agreement, and merge OR b) convert the PR to draft while we try to resolve the additional items.

Based on the discussion here, I think good next steps would be:

  1. start to create some reusable validation functions for simple types with special semantics (i.e. a tuf/api/validators.py module): spec_version, version, keyid, version,
  2. Experiment with using those validation functions in tuf.api, particularly in metadata.py – I would be particularly interested in:
    i. seeing whether the different approaches (descriptors/decorators/mixins) would force any changes to how the metadata API is currently being used in warehouse and the experimental-client.
    ii. gauging whether the different approaches (descriptors/decorators/mixins) support both the documented requirement to "A way to invoke all validation functions responsible to validate the class
    attributes in the middle of function execution" and the implicit requirement in the current ADR to avoid explicit calls to a validation function which may be missed due to programmer error.

@MVrachev
Copy link
Collaborator Author

For ADR0007 and this PR we could either a) update the ADR to capture only the decisions where we have agreement, and merge OR b) convert the PR to draft while we try to resolve the additional items.

I think it will be better if we keep this pr as a draft and open it until we have a final decision.
This will prompt us to decide and not postpone this any longer than necessary.

Based on the discussion here, I think good next steps would be:

  1. start to create some reusable validation functions for simple types with special semantics (i.e. a tuf/api/validators.py module): spec_version, version, keyid, version,
  2. Experiment with using those validation functions in tuf.api, particularly in metadata.py – I would be particularly interested in:
    i. seeing whether the different approaches (descriptors/decorators/mixins) would force any changes to how the metadata API is currently being used in warehouse and the experimental-client.
    ii. gauging whether the different approaches (descriptors/decorators/mixins) support both the documented requirement to "A way to invoke all validation functions responsible to validate the class
    attributes in the middle of function execution" and the implicit requirement in the current ADR to avoid explicit calls to a validation function which may be missed due to programmer error.

I will create validation functions and will experiment using descriptors who will utilize them.

@jku
Copy link
Member

jku commented Sep 8, 2021

I'm going to close this: it's valuable work but not for merging. Maybe we are at a point where we can evaluate our validation and needs again (like do we need serialization-time validation or do we have validation code that could be separated out of the object construction/deserialization) but that doesn't require this PR to stay open for months

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants