Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR0007: How we are going to validate the new API codebase #1301

145 changes: 145 additions & 0 deletions docs/adr/0007-validation-guideliness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# How we are going to validate the new API codebase

* Date: 2021-03-10

Technical Story:
- [securesystemslib schema checker issues](https://github.com/secure-systems-lab/securesystemslib/issues/183)
- [new TUF validation guidelines](https://github.com/theupdateframework/tuf/issues/1130)

## Context and Problem Statement

1. Some schemas sound more specific than they are.
2. Some schemas are an odd replacement for constants.
3. Schema validation is generally **overused**. Together with user input,
we are validating input programmatically generated from our private functions.
4. There are instances where some attributes are validated multiple times
when executing one API call.
5. Schema checking sometimes makes execution branches unreachable.
6. The error messages from checking schemas are often not helpful.

## Decision Drivers and Requirements
Some of the requirements we want to meet are:
1. The ability to decide which functions to validate and which not.
2. Allow for custom deeper validation beyond type check.
3. As little as possible performance overhead.
4. Add as minimal number of dependencies as possible.
5. Support for all python versions we are using.
MVrachev marked this conversation as resolved.
Show resolved Hide resolved

## Considered Options
1. Usage of a `ValidationMixin`.
2. Usage of a third-party library called `pydantic`.

## Pros, Cons, and Considerations of the Options

### Option 1: Usage of a ValidationMixin

**Note:** All pros, cons, and considerations are documented with the assumption
we would implement the `ValidationMixin` the same way it is implemented in
[in-toto](https://github.com/in-toto) until version 1.0.1 (the latest
version at the time of writing.)

* Good, because it's shorter by calling one function and validating
multiple fields.
MVrachev marked this conversation as resolved.
Show resolved Hide resolved

* Good, because it allows reuse of the validation code through
`securesystemslib.schemas` or another schema of our choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are focusing too much on securesystemslib.schemas. There is an unspoken consensus that we want to get rid of it (see secure-systems-lab/securesystemslib#183). And I don't think that its use is a defining aspect of the ValidationMixin. IIRC we only used them in in-toto because it was already there and we lacked a comprehensive class model for all complex data types.

Take for instance the _validate_keys method on the in-toto Layout class:

  def _validate_keys(self):
    securesystemslib.formats.ANY_PUBKEY_DICT_SCHEMA.check_match(self.keys)

Now if we had a PublicKey class with its own validators -- something we intend to add as per ADR4 particularly for the purpose of validation -- then the validator would probably look like this:

  def _validate_keys(self):
    if not isinstance(self.keys, dict):
      raise ...

    for keyid, key in self.keys.items():
      if not isinstance(key, PublicKey):
        raise ...
      key.validate()

      # NOTE: Even though ADR4 only talks about classes for complex attributes
      # it probably makes sense to add a `KeyID` class as well just so that we
      # have a reusable way of checking hex strings of a certain length.
      if not isinstance(keyid, KeyID):
        raise ...
      keyid.validate()

Or we could add a PublicKeyDict class, and do:

  def _validate_keys(self):
    if not isinstance(self.keys, PublicKeyDict):
      raise ...
    self.keys.validate()

Or, and that's my preferred way, we let a type annotation checker do the basic type checking for us, so that we always know that an in-toto layout's pubkeys variable contains a valid Dict[KeyID, PublicKey] value. Then we can use the _validate_* for more complex checks, such as:

  def _validate_keys(self):
    # We already know that 'self.pubkeys' is a valid Dict[KeyID, PublicKey]
    # value so no need to check this here...

    # ... but we also want to make sure that we don't have duplicate keys.
    if len(self.pubkeys.keys()) != len(set(self.pubkeys.keys())):
      raise ...

I'd say the ValidationMixin really just provides a shortcut -- i.e. validate() -- to all _validate_* methods on a class, which seems very similar to pydantic's @validator decorator (please correct me if I'm wrong!!)

So the main question to me is, is that decorator better than our custom _validate_* convention, and does pydantic have other features (that we can't just easily add to our ValidationMixin) that justify a +7K LOC dependency. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed: we should use ideally no 3rd-party dependency.

Copy link
Collaborator Author

@MVrachev MVrachev Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update the ADR to not sound like the idea with ValidationMixin fully depended on the schemas, but
it seems it didn't sound that way :D.

I understand the idea with the ValidationMixin and it has good points.
If we decide to go into that path, probably it's fixing the existing schemas instead of straightly removing them.

Or, and that's my preferred way, we let a type annotation checker do the basic type checking for us,

@lukpueh which type of annotation checker do you use in in-toto?
How many dependencies will it add into tuf and (if you can easily check that) how many lines of code?

As I said in my detailed comment here #1301 (comment) there are more useful features in pydantic which could be useful for us.


* Bad, because there could be different code paths and return statements, and as
a consequence there could be a code path which doesn't call `validate()`.

Examle:
```python
class User(ValidationMixin):

def __init__(self, id: int, nickname: str) -> None:
self.id = id
self.nickname = nickname
self.pro_user = False

self.validate()

def _validate_id(self):
if not isinstance(self.id, int):
raise FormatError(f'id should be from type int')

if self.id < 0:
raise ValueError(f'id is expected to be a positive number')

def update_profile(self, new_id: int, new_nickname: str):
self.id = new_id

if not self.pro_user:
print(f'Standart users can only change their id! '
f'If you want to change your nickname become a pro user.)

return

self.nickname = new_nickname
# Be careful if you rely on _validate_id() to verify self.id!
# This won't be called if new_name is "".
self.validate()
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem is related to different code paths, but rather that User.validate() is simply not suited for non-user input validation. In your example you are not validating inputs, instead you are validating a user object after having performed some unvetted modifications.

If you want to use the function for input validation you have to do it when the input is a user object, e.g.:

def print_user(user: User):
  user.validate()
  print(user)

In either case you have to make sure to actually call the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't do argument validation.
Your argument showcases what would happen if you pass an object implementing validate(), but it's not always like that and for simple types like int, str, etc. I don't expect us to create custom classes.

With this argument, I wanted to stress my argument, that because you have no function argument validation
and no validation on assignment you can change your class attributes and forget to call validate() in the end.


* *Personal opinion*: bad, because it's not a clean solution from an OOP
perspective to inherit `ValidationMixin` from classes without a "IS A"
relationship with it.
MVrachev marked this conversation as resolved.
Show resolved Hide resolved

* Consideration: if we use this option, we are limited on what can be validated.
With the `in-toto` implementation of the `ValidationMixin`, we can only validate
class attributes inside class methods.
If we want to validate functions outside classes or function arguments we would
have to enhance this solution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by "validate functions", but as you can see in my print_user example above, you can validate function arguments outside of classes. But yes, you can only validate objects that implement a validate method.


* Consideration: if we use this option, we would be responsible for the code
and all identified issues related to `securesystemslib.schemas` should be
resolved by us or replace the schema implementation with something else.

* Consideration: if we want to enforce assignment validation, this solution
should be combined with custom "setter" properties.

### Option 2: Usage of a third-party library called "pydantic"

* Good, because it's flexible:
1. There is a `@validate_arguments` decorator which allows us to decide which
functions to validate and the ability to validate functions outside classes.
2. There is a `@validator` decorator which allows us to make a deeper validation
beyond type checking for our class attributes.
3. We can use an embedded `Config` class inside our classes, which allows for
even more customization (for example enforce assignment validation).

* Good, because (according to their documentation) `pydantic` is the fastest
validation library compared to others (including our other third-party library
option `marshmallow`).
See: https://pydantic-docs.helpmanual.io/benchmarks/

* Good, because it uses the built-in types from `python 3.6` onwards.

* Good, because it allows for strict type checks through `StrictInt`, `StrictStr`,
`StrictFloat`, `StrictBool` and `StrictBytes` types defined in `pydantic` .
They have not yet implemented a classwide strict mode where all fields will be
considered automatically as "strict", but there is a discussion about it:
See: https://github.com/samuelcolvin/pydantic/issues/1098

* Bad, because there is a learning curve when using `pydantic`.
1. For example, when I had to handle the `_type` attribute in `Signed` it took me
a lot of reading to understand that standard attributes whose name begin with
"_" are ignored. The `_type` attribute can only be `PrivateAttr`
(defined in `pydantic`) even though we don't handle it as a typical private
attribute.
2. Also, I had difficulties using pydantic when there is inheritance.
The initialization and validation of new objects was tricky.

* Bad, because it adds `2` new dependencies: `pydantic` and `typing-extensions`.
This was concluded by performing the following steps:
1. Creating a fresh virtual environment with python3.8.
2. Installing all dependencies in `requirements-dev.txt` from `tuf`.
3. Install `pydantic` with `pip install pydantic`.

## Links
* [in-toto ValidatorMixin](https://github.com/in-toto/in-toto/blob/74da7a/in_toto/models/common.py#L27-L40)
* [ValidatorMixing usage](https://github.com/in-toto/in-toto/blob/74da7a/in_toto/models/layout.py#L420-L438)
* [Pydantic documentation](https://pydantic-docs.helpmanual.io/)

## Decision Outcome

*TODO: Make and describe the decision*
1 change: 1 addition & 0 deletions docs/adr/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This log lists the architectural decisions for tuf.
- [ADR-0004](0004-extent-of-OOP-in-metadata-model.md) - Add classes for complex metadata attributes
- [ADR-0005](0005-use-google-python-style-guide.md) - Use Google Python style guide with minimal refinements

TODO: ADD TO INDEX
<!-- adrlogstop -->

For new ADRs, please use [template.md](template.md) as basis.
Expand Down