-
-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Revamp "unknown" license detection #1675
Comments
This all makes good sense, but I think that we need to look a bit harder at other-copyleft and other-permissive because there seems to be a non-trivial difference between one-off/rare licenses and unknown license text. One approach might be to design a naming convention like other-copyleft-001 etc for cases where we do have a match to meaningful license text, but the license is very rare. |
Per SPDX these two are the very same. Signed-off-by: Philippe Ombredanne <[email protected]>
Awesome |
@LOV3hurtS thank you for chiming in... can you elaborate? |
With the introduction of the "is_license_intro" flag, the license that are introductions texts are now tagged correctly with this flag. All have also a license expression of "unknown-license-reference" Signed-off-by: Philippe Ombredanne <[email protected]>
This way it becomes easy to spot them Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Also improve attribute flags documentation Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne One question regarding the addition of the There are the cases of having just these as license_expressions for which things are simple:
There are also these license expressions with more than one license key, with one of them being of the
An example rule and .yml file with Like in this case the unknown was for What about the cases where the unknown is there with an
Can we say these are not @akugarg for your project |
@akugarg btw there's also the Here a LicenseSymbolLike with Here is we do not find a match to any known spdx key, we return as a Here we create a synthetic SpdxRule object for an spdx match, in this case the @pombredanne am I suggesting this in the right way? |
Example for unknown matching
|
Note that this is NOT YET returned in the API and outputs Signed-off-by: Philippe Ombredanne <[email protected]>
Detect unknown licenses #1675 Signed-off-by: Philippe Ombredanne <[email protected]>
#2592 now implements an ngram-based approach with this logic:
In addition we also have these changes merged in:
|
@sameer1046 FYI |
As part of this we should also revisit the meaning of "unknown" licenses.
As discussed with @DennisClark most of these licenses have inadequate notes to explain exactly what they designate. What this really means is that in most cases, the detected license is NOT "unknown" but rather it would fall in one of these cases:
See also #2878 |
This is completed and merged, closing! |
Context and Problem
The whole idea of detecting unknown licenses and reporting these with the real detection is a bit weird and confusing.
We should make this area clearer and cleaner by cleaning up license keys in use and changing the way "unknown" are reported, named and detected. Since we eventually detect many short mentions and do diffs, we may at time catch too many of such "unknown" or not enough and returning these in the same data structure as the detected licenses may be confusing.
I think we can do better.
First let's refine what we mean by unknown: these is a detected texts that is highly likely to be a license text or notice, but that cannot be properly matched to a known,
named
license (one with a proper, notunknown
license key.When doing a full scrub on a codebase they are important as they are the things that needs a detailed review.
This is a proposal to revamp and clarify the way detect, process and report these.
There are several things to consider:
Improve the License data model definition (e.g. the License "YAML" data)
Unknown means for now that this is something matched to a license rule tagged with the "unknown" license key. It would help to clarify which licenses are "special" licenses by having proper attributes. Today we can only identify some "unknown" license based on its key.
Beyond the regular "named" licenses, we have a few "special" license keys that could be made more explicit.
They could be separated in these categories:
commercial-license
proprietary-license
generic-cla
other-copyleft
other-permissive
public-domain
public-domain-disclaimer
warranty-disclaimer
All of these are used as an alternative to a named license (and sometimes for rarer licenses not worthy of a named entry). These are all generic "catch-all" license keys. Each match can reference a different text or notice: they do not have a self-standing reference license text stored with the .LICENSE file. We could make these more explicit by adding a
is_generic
license attribute, to mark them as being different from the default "named" licenses.unknown
free-unknown
unknown-spdx
unknown-license-reference
These are used to depict a detection that is likely about licenses but is NOT matched to a proper rule for a named or generic license.
Instead this is matched to a rule tagged with one of these
unknown
license keys.unknown-license-reference
that are more like generic intro texts about licensing. The .RULE file names have been prefixed with "lead-in".For instance "Software License agreement" would be such as an intro text.
So to handle these all these, I suggest we could add these new model attributes :
In the License (and also to the License Rule):is_generic
tells that a license is generic (case 1.)is_unknown
tells that a license is about some unnamed license. (case 2.)In the License Rule:is_license_intro
to tell that a license detection rule is some license intro text. (case 3.). This would be used in conjunction with theunknown
license key.All done:
✔️ implemented with
is_generic
flag✔️ implemented with
is_unknown
flag for licenses andhas_unknown
property for rules✔️ This has been implemented with
is_license_intro
flagLicense keys cleanup and retirement : use fewer license keys for unknown things.
The
free-unknown
is a weird one that should be retired: we should return eitherunknown
orother-copyleft
orother-permissive
instead.unknown-spdx
is used to report some unknown license in an SPDX license expression. Since this is only produced by the SPDX expression detection it would unlikely be part of other "unknown" detection. It should be tagged withis_unknown
unknown-license-reference
is special if and only if it has areferenced_filenames
. So it could be folded inunknown
too.The ones that have a
referenced_filenames
will be dealt with accordingly. Otherwise any rule can be tagged withis_license_reference
to the same effect.We have also a category of unknown license rules that have been for now renamed with the prefix
lead-in_
that are not really depicting unknown licenses but rather are intro or title texts used before a license text or notice. For instanceSoftware license agreement
orAs a special exception
oris under the following license
etc. These are really special and are of interest if and only if they show up with otherwise unknown license texts. When they are detected just before an actual known, named license text or notice, reporting them as unknown is noisy and instead their match should be folded in the main match they introduce. We should tag these asunknown
andis_license_intro
.improve the way we deal with multiple "low score" detected licenses in a single file
We could merge matches to different detected "rules" in the same text region when they point to the related license in one match.
We could also handle the cases where we have some mentions of bare GPL (detected as gpl-1.0-plus) together with some GPL version: these could be merged in some case as one match to the versioned GPL. This is mostly for FSF licenses
Report unknowns separately
We should use a separate section of the scan results to report "unknown" license detections, not mixed with the main license detection for clarity.
This could be new section as "unknown_license_references" where we report the matched text and positions but we cannot report a specific matched license rule or key.
Improve the detection of unknown licenses
Beside or as a replacement to the actual detection of the "unknown" license rules, we should have a new way and more efficient way to detect unknown licenses using ngrams. The process would be roughly:
Consider also including any "unknown" rule matches.
Follow license references to another file
This is for references such as "see COPYING for license" to report the detected license in the referenced file if any e.g. mentions to look for license details in another file.
We should have a way to follow license references using the
referenced_filenames
attribute and find the detected license in these filenames. And when such a conclusive reference is found positively, we should update the match to use the referenced license(s) that were detected.This is tracked in #1364
Note that for now, this does NOT include following URLs which would imply having network access.
Properly handle "intro texts" that are used generically to introduce license terms
The are "lead-in"-like intro texts and are detected alone and qualified as "unknown" today. For instance "License agreement" is a rule that would be detected as such.
We should have a way to detect such a text fragment and to discard or merge such a detection if this is immediately tied to an actual "named" license detection. This would avoid many unknown detections.
The text was updated successfully, but these errors were encountered: