Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter for scanner configuration in ScanResult storages is too restrictive #3223

Open
oheger-bosch opened this issue Oct 20, 2020 · 13 comments
Assignees
Labels
enhancement Issues that are considered to be enhancements scanner About the scanner tool

Comments

@oheger-bosch
Copy link
Member

The ScannerDetails class has a configuration property of type String that corresponds to the command line options used by the scanner. When looking up scan results from a storage this String is currently compared using an exact match, to find out whether a result from the storage is applicable.

While this may work in a pure local setup, it is too restrictive when dealing with external storages like ClearlyDefined or other services. It is very unlikely that there is an exact match of the command line options in use. But even in a more controlled environment, the current mechanism is problematic: You just have to change the order of parameters, or modify parameters that do not affect the scanner's output (such as a debug flag or changing the parallelism), and a stored result will not be reused.

Therefore, the filtering for the scanner configuration should be improved. The new implementation should, however, be backwards compatible if possible; already existing and populated storages should continue working.

@oheger-bosch oheger-bosch self-assigned this Oct 20, 2020
@sschuberth sschuberth added enhancement Issues that are considered to be enhancements scanner About the scanner tool labels Oct 20, 2020
@oheger-bosch
Copy link
Member Author

Also strongly related to #3247.

@oheger-bosch
Copy link
Member Author

This issue is addressed by a PR, but as reviewers have correctly remarked, the proposed solution becomes pretty complex. When we try to model the various options supported by ScanCode (or other scanners) and different logic how to check their compatibility, an inherent complexity probably cannot be avoided.

Therefore, I was thinking about alternative approaches. The current implementation of ScanCode already distinguishes between command line options that have an impact on the scanner output and others that do not. One possibility could be to extend this mechanism, so that the user is responsible of defining the compatibility of scanner options, but also has full control over it. The idea is that the scanner options in the ORT configuration could support properties to specify which scanner options must be set (or must not be present) to consider a scan result as compatible. In ort.conf, this could roughly look as follows:

options {
  ScanCode: {
    criteria: {
      includeOptions: "copyright,package,strip-root"
      excludeOptions: "shallow"
    }
  }
}

Using this configuration, a scan result would be accepted from a storage if it was produced with the command line options "--copyright --package --strip-root", but not with the option "--shallow". (All other options would be ignored.) Compared to the complexity of the mentioned PR, such checks would be rather simple, but with some labour put into the configuration, the desired effects could be achieved.

@sschuberth, @mnonnenmacher @fviernau, what do you think?

@oheger-bosch
Copy link
Member Author

@sschuberth, @mnonnenmacher, @fviernau: Any comments here?

@sschuberth
Copy link
Member

I like the idea of it being fully user configurable which options break scanner compatibility and which not. I'd only propose a slightly simpler configuration syntax, like

options {
  ScanCode: {
    compatibility: {
      disregard: "--disregard-this-option-in-compatibility-checks"
    }
  }
}

The disregard key could also be called ignore. How about that?

However, as always the devil is in the details. Suppose you'd want to ignore any --timeout option, including --timeout 300 and --timeout 500. How would you do that? We could try to be smart and make disregarding --timeout also match arbitrary strings until we come across a new option, but that's somewhat dangerous. Or we need to make the disregard field a list of regexes, similar to like we do error resolution matching...

@oheger-bosch
Copy link
Member Author

Good idea to specify only the options to disregard/ignore, this is easier.

Regarding the matching of specific values: I have already some code that breaks a scanner command line into key-value pairs. So it would be possible to test against specific option values. The semantic could be: If the ignore expression is a simple string, we interpret this as option name and ignore the corresponding option. If it has the form name=pattern, we match the option value against this pattern.

Having to choose between disregard and ignore, I would go for the latter; I think this is more common, isn't it?

@sschuberth
Copy link
Member

If it has the form name=pattern, we match the option value against this pattern.

Could you give a concrete example? How would the syntax look like for ignoring --timeout 3 and --threshold 20 independently of the actual numbers passed?

Having to choose between disregard and ignore, I would go for the latter; I think this is more common, isn't it?

Probably. I just somehow wanted to distinguish from ScanCode's --ignore option (as ignore: "--ignore" would read a bit funny), but as that's very ScanCode-specific I'm fine with calling the field ignore.

@oheger-bosch
Copy link
Member Author

As the disregard/ignore field is a list, we could simply repeat patterns for specific options. So for your example, the configuration could look like:

ignore: [
  "timeout=3",
  "threshold=20"
]

To ignore arbitrary patterns passed to ScanCode's ignore options, the configuration would be

ignore: [
  "ignore"
]

While thinking about this, this is a quite safe approach, in the sense that unexpected options cause a result to be considered incompatible. To support the creation of such configurations, we probably need to generate some log output why a result from a scan storage could not be reused.

@sschuberth
Copy link
Member

ignore: [
"timeout=3",
"threshold=20"
]

My question was specifically about how this would look like if you don't know (and don't care) what the concrete numbers for timeout and threshold are. Think about the ClearlyDefined case: We'd probably want to accept scan results from ClearlyDefined no matter what their (current) timeout value is, and we'd still want to accept the scan results even if ClearlyDefined changes the timeout value in the future.

I believe I'd prefer something that more explicitly matches the options, like

ignore: [
    { name: "--timeout", hasValue: true },
    { name: "--threshold", hasValue: true },
    { name: "--verbose", hasValue: false }
]

This would ignore the --timeout option, which is known to have a value (separated by either " " or "=" on the CLI), despite its value. Same for the --threshold option. The --verbose option would also be ignored, but it's not supposed to have a value.

Two other extremes to think about would be to make the comparison fully programmable by running a Kotlin script, similar to Evaluator rules, that get the ScannerDetails passed and returns a boolean. Or we simply ditch the compatibility check altogether and regard everything as compatible; if the user does not want this, (s)he would first need to manually remove incompatible scan results from the scan storage.

@oheger-bosch
Copy link
Member Author

Too many options :-(.

I think the last extreme - dropping the compatibility check - requires that the user has full control over the scan storage. This is not the case when integrating external services like ClearlyDefined.

Making the check scriptable would be the most flexible option. But maybe we should then provide a default script implementing a meaningful default behaviour and allow the user to override this in the configuration?

@oheger-bosch
Copy link
Member Author

So we still have not yet decided how to proceed here. My proposal would be to go for the last ignore configuration you suggested and skip the extreme cases for now. This would move us forward, and additional flexibility could be dealt with later if it was actually needed. Would this be okay?

@oheger-bosch
Copy link
Member Author

Just noticed a problem with the configuration of the options to ignore: Currently, ScannerConfiguration just defines a plain Map<String, String> to hold arbitrary options for specific scanners. This map is not able to hold structured properties like lists with pattern to ignore. Also, the Hoplite configuration library cannot handle maps of arbitrary types.

I therefore propose to change this part of the scanner configuration and introduce a new ScannerOptions class that replaces the map with properties. ScannerOptions would still contain a map with properties to allow for a flexible configuration, but in addition can store other type-safe properties common to all scanners, such as the compatibility configuration. (We might see later how this fits with remote scanners.)

@sschuberth
Copy link
Member

I therefore propose to change this part of the scanner configuration and introduce a new ScannerOptions class that replaces the map with properties.

That indeed sounds like a good first / non-breaking change into the right direction to me.

@fviernau
Copy link
Member

@oheger-bosch - FYI: I just proposed a change in the same are - thus there is a bit potential of a slight conflict towards that "--ignore" option, see #3510.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues that are considered to be enhancements scanner About the scanner tool
Projects
None yet
Development

No branches or pull requests

3 participants