Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR0007: How we are going to validate the new API codebase #1301

344 changes: 344 additions & 0 deletions docs/adr/0007-validation-guideliness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
# How we are going to validate the new API codebase

* Date: 2021-03-10

Technical Story:
- [securesystemslib schema checker issues](https://github.com/secure-systems-lab/securesystemslib/issues/183)
- [new TUF validation guidelines](https://github.com/theupdateframework/tuf/issues/1130)

## Context and Problem Statement

1. Some schemas sound more specific than they are.
2. Some schemas are an odd replacement for constants.
3. Schema validation is generally **overused**. Together with user input,
we are validating input programmatically generated from our private functions.
4. There are instances where some attributes are validated multiple times
when executing one API call.
5. Schema checking sometimes makes execution branches unreachable.
6. The error messages from checking schemas are often not helpful.

## Decision Drivers and Requirements
Some of the requirements we want to meet are:
1. The ability to decide which functions to validate and which not.
2. Allow for custom deeper validation beyond type check.
3. As little as possible performance overhead.
4. Add as minimal number of dependencies as possible.
5. Support for all python versions we are using.
MVrachev marked this conversation as resolved.
Show resolved Hide resolved
6. Reuse code used for validation.
7. A way to invoke all validation functions responsible to validate the class
attributes in the middle of function execution.

## Considered Options
1. Usage of a `ValidationMixin`.
2. Usage of a third-party library called `pydantic`.
3. Usage of a third-party library called `marshmallow`.
4. Combined option including a third-party library called `typeguard`, and
usage of python descriptors.
5. Combined option including a third-party library called `typeguard`, and
usage of `ValidationMixin`

## Pros, Cons, and Considerations of the Options

Here is how all of our options compare against our requirements:

| Number | Requirement | ValidationMixin | pydantic | marshmallow | typeguard + Descriptors | typeguard + ValidationMixin |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| 1 | Validate function args everywhere | Limited | ✓ | Limited | ✓ | ✓ |
| 2 | Custom deeper validation | ✓ | ✓ | ✓ | ✓ | ✓ |
| 3 | Performance overhead | Minimal | [Fastest](https://pydantic-docs.helpmanual.io/benchmarks/) | [Slower](https://pydantic-docs.helpmanual.io/benchmarks/) | No information | No information |
| 4 | New [dependencies ; files ; lines of code;] | [0; no info; no info] | [2; 26; 8982] | [1; 13; 2652] | [1; 3; 899] | [1; 3; 899] |
| 5 | Support for all python versions | ✓ | ✓ | ✓ | ✓ | ✓ |
| 6 | Code reuse for validation | ✓ | ✓ | ✓ | ✓ | ✓ |
| 7 | Way to invoke all validators | ✓ | ✓ | ✓ | ✘ | ✓ |

For requirement number 4, the calculations were made with [sloccount](https://linux.die.net/man/1/sloccount)
on March 30-th 2021.
Those results were concluded by performing the following steps for each of the
options using third-party libraries:
1. Creating a fresh virtual environment with python3.8.
2. Installing all dependencies in `requirements-dev.txt` from `tuf`.
3. Install `<package_name>` with `pip install <package_name>`.
4. Use `sloccount` to count added lines and files by the package and its
dependencies.

Only source files were used when calculating the lines of code and number of
files (meaning all test files, doc folders, and other meta or configuration
files were excluded).

Bellow, in the following sections, there are additional pros, cons, and
considerations for each of the options.

### Option 1: Usage of a ValidationMixin

**Note:** All pros, cons, and considerations are documented with the assumption
we would implement the `ValidationMixin` the same way it is implemented in
[in-toto](https://github.com/in-toto) until version 1.0.1 (the latest
version at the time of writing.)

* Bad, because there could be different code paths and return statements, and as
a consequence there could be a code path that doesn't call `validate()`.

Examle:
```python
class User(ValidationMixin):

def __init__(self, id: int, nickname: str) -> None:
self.id = id
self.nickname = nickname
self.pro_user = False

self.validate()

def _validate_id(self):
if not isinstance(self.id, int):
raise FormatError(f'id should be from type int')

if self.id < 0:
raise ValueError(f'id is expected to be a positive number')

def update_profile(self, new_id: int, new_nickname: str):
self.id = new_id

if not self.pro_user:
print(f'Standart users can only change their id! '
f'If you want to change your nickname become a pro user.)

return

self.nickname = new_nickname
# Be careful if you rely on _validate_id() to verify self.id!
# This won't be called if new_name is "".
self.validate()
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem is related to different code paths, but rather that User.validate() is simply not suited for non-user input validation. In your example you are not validating inputs, instead you are validating a user object after having performed some unvetted modifications.

If you want to use the function for input validation you have to do it when the input is a user object, e.g.:

def print_user(user: User):
  user.validate()
  print(user)

In either case you have to make sure to actually call the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't do argument validation.
Your argument showcases what would happen if you pass an object implementing validate(), but it's not always like that and for simple types like int, str, etc. I don't expect us to create custom classes.

With this argument, I wanted to stress my argument, that because you have no function argument validation
and no validation on assignment you can change your class attributes and forget to call validate() in the end.


* *Personal opinion*: bad, because it's not a clean solution from an OOP
perspective to inherit `ValidationMixin` from classes without a "IS A"
relationship with it.
MVrachev marked this conversation as resolved.
Show resolved Hide resolved

* Consideration (*related to requirement 1*): if we use this option, we are limited on
what can be validated.
With the `in-toto` implementation of the `ValidationMixin`, we can only validate
class attributes inside class methods.
If we want to validate functions outside classes or function arguments we would
have to enhance this solution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by "validate functions", but as you can see in my print_user example above, you can validate function arguments outside of classes. But yes, you can only validate objects that implement a validate method.


* Consideration: if we use this option, we would be responsible for the code
and all identified issues related to `securesystemslib.schemas` should be
resolved by us or replace the schema implementation with something else.

* Consideration: if we want to enforce assignment validation, this solution
should be combined with custom "setter" properties.

### Option 2: Usage of a third-party library called "pydantic"

* Good, because it's flexible:
1. There is a `@validate_arguments` decorator which allows us to decide which
functions to validate and the ability to validate functions outside classes.
2. There is a `@validator` decorator which allows us to make a deeper validation
beyond type checking for our class attributes.
3. We can use an embedded `Config` class inside our classes, which allows for
even more customization (for example enforce assignment validation).

* Good, because it allows for strict type checks through `StrictInt`, `StrictStr`,
`StrictFloat`, `StrictBool` and `StrictBytes` types defined in `pydantic` .
They have not yet implemented a classwide strict mode where all fields will be
considered automatically as "strict", but there is a discussion about it:
See: https://github.com/samuelcolvin/pydantic/issues/1098

* Good, because it provides additional custom types (with their own built-in
validation) like `FilePath`/`DirectoryPath` (like `typing.Path`, but there is also
validation that the file/directory exists), `PositiveInt`, `IPvAnyAddress`
(for IP versions 4 and 6), `HttpUrl` (for HTTP and HTTPS URLs) etc.
Also, `pydantic` has field (class attributes) constraints. This could be useful
when verifying that a HEX string has the expected length.

* Good, if we decide to require that our objects should be valid and fully
populated all of the time. This could easily be marking our fields (class
attributes) as `required` (which will raise an error for partially populated
objects) and embed a `Config` class inside our classes with an option to enforce
assignment validation.

* Bad, because there is a learning curve when using `pydantic`.
1. For example, when I had to handle the `_type` attribute in `Signed` it took me
a lot of reading to understand that standard attributes whose name begin with
"_" are ignored. The `_type` attribute can only be `PrivateAttr`
(defined in `pydantic`) even though we don't handle it as a typical private
attribute.
2. Also, I had difficulties using `pydantic` when there is inheritance.
The initialization and validation of new objects was tricky.

* Bad, because it adds `2` new dependencies: `pydantic` and `typing-extensions`.
This was concluded by performing the following steps:
1. Creating a fresh virtual environment with python3.8.
2. Installing all dependencies in `requirements-dev.txt` from `tuf`.
3. Install `pydantic` with `pip install pydantic`.

* Explanation about requirment number 7: there is a helper function called
`validate_model(model: Type[BaseModel], input_data: DictStrAny)` which can be
used to invoke all validators from your current or parent classes.
This is how it can be used:
```python
class User(BaseModel):
email: StrictStr

@validator('email')
def validate_email(cls, email):
# Some regular expression checks for a valid email here
return email

def change_email(self, new_email: StrictStr):
self.email = new_email


class RepositoryManager(BaseModel):
repositoryUsers: List[User]
writer: object

def validate(self):
*_, validation_error = validate_model(self.__class__, self.__dict__)
if validation_error:
raise validation_error

def write_users_to_file(self):
for user in self.repositoryUsers:

# Validate that user objects have valid fields before writing
# even if their email has changed.
user.validate()

self.writer.write_to_file(self.repositoryUsers)
```

### Option 3: Usage of a third-part library called "marshmallow"

* Good, because it allows for strict type checks by marking the class attributes
(or Fields as they call them) as `strict`.

* Good, because it provides additional custom types (with their own built-in
validation) like `URL`, `IPv4`, `IPv6`, etc.

* Bad, because it's created with schemas in mind and a heavy focus on
serialization and deserialization. Most of the features are not related
to validation.

* Bad, because it adds one additional dependency - itself.
This was concluded by performing the following steps:
1. Creating a fresh virtual environment with python3.8.
2. Installing all dependencies in `requirements-dev.txt` from `tuf`.
3. Install `marshmallow` with `pip install marshmallow`.

* Bad, because they use their custom types even for types existing in the
standard `typing` module from python 3.6 onwards. This means that integrating
`marshmallow` would make up for a bigger diff compared to `pydantic`.
Additionally, because they define their types there could be problems specific
to their types and conversion from-to standard types as defined in the `typing`
python module.
This was the case when I researched `marshmallow` and had to use the
`marshmallow.fields.DateTime` class instead of the `datetime.datetime` object.


### Option 4: Third-party library called `typeguard` and python descriptors

* Good (*related to requirment 1*), because it allows for strict type checks
because of `typeguard`.

* Good, because requires little code changes to enable strict type checking and
at the same time, it allows you to ignore validation for certain functions by
using a decorator @typeguard.typechecked (not released in the latest version,
but it's added to the master branch on GitHub).

* Good, because enforces assignment validation.

* Good, because it allows us to enforce custom restrictions with one-liners.
For example, we can pass a `predicate` function for our types which could be
whatever we want.
See [Python descriptors](https://docs.python.org/3/howto/descriptor.html#custom-validators)

* Bad (*related to requirement number 7*), there is no way of invoking the
validation functions for each of the fields whenever we want. This is because
we are overriding the `__get__` magic method and no longer can we access the
class instance.
Example:
```python
class Integer:
def __init__(self, minvalue=None, maxvalue=None):
self.minvalue = minvalue
self.maxvalue = maxvalue

def __set_name__(self, owner, name):
self.private_name = "_" + "NO_VALIDATION_" + name

def __get__(self, obj, objtype=None):
return getattr(obj, self.private_name)

def __set__(self, obj, value):
self.validate(value)
setattr(obj, self.private_name, value)

def validate(self, value):
if not isinstance(value, int):
raise TypeError(f"Expected {value!r} to be an int or float")
if self.minvalue is not None and value < self.minvalue:
raise ValueError(
f"Expected {value!r} to be at least {self.minvalue!r}"
)
if self.maxvalue is not None and value > self.maxvalue:
raise ValueError(
f"Expected {value!r} to be no more than {self.maxvalue!r}"
)

class User:
id: Integer(minvalue=0, maxvalue=1000)

def __init__(self, id: int):
self.id = id # Validation will be invoked

def important_operation(self):
# I want to validate the user id here before performing the important operation.
# If use "self.id.validate()" then "self.id" will invoke the Integer __get__()
# method and the result will be int.validate(), but an int doesn't have
# a validate method...
```

* Bad (*related to requirement 4*), because it adds one additional dependency - itself.
This was concluded by performing the following steps:
1. Creating a fresh virtual environment with python3.8.
2. Installing all dependencies in `requirements-dev.txt` from `tuf`.
3. Install `typeguard` with `pip install typeguard`.

* Consideration: the `typeguard` project is maintained by only 1 developer and
the project is located as his personal repository.
We should keep an eye on the project development.

* Consideration: in the example above, when we use `__set__` and `__set_name__`
in the `Integer` class we set additional argument for each `User` instance.
This attribute is named `_NO_VALIDATION_id` and if we decide we can bypass the
validation like this:
```python
u = User(3)
u._NO_VALIDATION_id = -100 # No validation will be invoked.

# u.id will call the "__get__()" method of the Integer class and will return -100.
print(u.id)
```

### Option 5: Third-party library called `typeguard` and `ValidationMixin`

* All pros, cons, and consideration from option 1 are applied here too,
with the exception that we can validate function arguments everywhere because
of `typeguard`.

* All pros, cons, and consideration from option 4 related to `typeguard` apply
here too.

## Links
* [in-toto ValidatorMixin](https://github.com/in-toto/in-toto/blob/74da7a/in_toto/models/common.py#L27-L40)
* [ValidatorMixing usage](https://github.com/in-toto/in-toto/blob/74da7a/in_toto/models/layout.py#L420-L438)
* [Pydantic documentation](https://pydantic-docs.helpmanual.io/)
* [Marshmallow documentation](https://marshmallow.readthedocs.io/)
* [typeguard GitHub page](https://github.com/agronholm/typeguard)
* [Python descriptors](https://docs.python.org/3/howto/descriptor.html#custom-validators)

## Decision Outcome

*TODO: Make and describe the decision*
1 change: 1 addition & 0 deletions docs/adr/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This log lists the architectural decisions for tuf.
- [ADR-0004](0004-extent-of-OOP-in-metadata-model.md) - Add classes for complex metadata attributes
- [ADR-0005](0005-use-google-python-style-guide.md) - Use Google Python style guide with minimal refinements

TODO: ADD TO INDEX
<!-- adrlogstop -->

For new ADRs, please use [template.md](template.md) as basis.
Expand Down