RFC: Revamp "unknown" license detection #1675

pombredanne · 2019-08-08T08:05:04Z

Context and Problem

The whole idea of detecting unknown licenses and reporting these with the real detection is a bit weird and confusing.

We should make this area clearer and cleaner by cleaning up license keys in use and changing the way "unknown" are reported, named and detected. Since we eventually detect many short mentions and do diffs, we may at time catch too many of such "unknown" or not enough and returning these in the same data structure as the detected licenses may be confusing.

I think we can do better.

First let's refine what we mean by unknown: these is a detected texts that is highly likely to be a license text or notice, but that cannot be properly matched to a known, named license (one with a proper, not unknown license key.

When doing a full scrub on a codebase they are important as they are the things that needs a detailed review.

This is a proposal to revamp and clarify the way detect, process and report these.

There are several things to consider:

improve the data model for the License "YAML" data
use fewer license keys for unknown things
improve the way we deal with multiple "low score" detected licenses in a single file
report unknowns in a separate section of the scan results, not mixed with the main license detection
improve the detection of unknown licenses
follow license referenced in references such as "see COPYING for license" to report the detected license in the referenced file if any.
properly handle "intro texts" that are used generically to introduce license terms

Improve the License data model definition (e.g. the License "YAML" data)

Unknown means for now that this is something matched to a license rule tagged with the "unknown" license key. It would help to clarify which licenses are "special" licenses by having proper attributes. Today we can only identify some "unknown" license based on its key.

Beyond the regular "named" licenses, we have a few "special" license keys that could be made more explicit.

They could be separated in these categories:

"generic" licenses: commercial-license proprietary-license generic-cla other-copyleft other-permissive public-domain public-domain-disclaimer warranty-disclaimer

All of these are used as an alternative to a named license (and sometimes for rarer licenses not worthy of a named entry). These are all generic "catch-all" license keys. Each match can reference a different text or notice: they do not have a self-standing reference license text stored with the .LICENSE file. We could make these more explicit by adding a is_generic license attribute, to mark them as being different from the default "named" licenses.

"unknown" licenses: unknown free-unknown unknown-spdx unknown-license-reference

These are used to depict a detection that is likely about licenses but is NOT matched to a proper rule for a named or generic license.
Instead this is matched to a rule tagged with one of these unknown license keys.

✔️ Also we have unknown-license-reference that are more like generic intro texts about licensing. The .RULE file names have been prefixed with "lead-in".
For instance "Software License agreement" would be such as an intro text.

So to handle these all these, I suggest we could add these new model attributes :

~~In the License (and also to the License Rule):~~

~~is_generic tells that a license is generic (case 1.)~~
~~is_unknown tells that a license is about some unnamed license. (case 2.)~~

~~In the License Rule:~~

~~is_license_intro to tell that a license detection rule is some license intro text. (case 3.). This would be used in conjunction with the unknown license key.~~

All done:
✔️ implemented with is_generic flag
✔️ implemented with is_unknown flag for licenses and has_unknown property for rules
✔️ This has been implemented with is_license_intro flag

License keys cleanup and retirement : use fewer license keys for unknown things.

The free-unknown is a weird one that should be retired: we should return either unknown or other-copyleft or other-permissive instead.
unknown-spdx is used to report some unknown license in an SPDX license expression. Since this is only produced by the SPDX expression detection it would unlikely be part of other "unknown" detection. It should be tagged with is_unknown
unknown-license-reference is special if and only if it has a referenced_filenames. So it could be folded in unknown too.
The ones that have a referenced_filenames will be dealt with accordingly. Otherwise any rule can be tagged with is_license_reference to the same effect.
We have also a category of unknown license rules that have been for now renamed with the prefix lead-in_ that are not really depicting unknown licenses but rather are intro or title texts used before a license text or notice. For instance Software license agreement or As a special exception or is under the following license etc. These are really special and are of interest if and only if they show up with otherwise unknown license texts. When they are detected just before an actual known, named license text or notice, reporting them as unknown is noisy and instead their match should be folded in the main match they introduce. We should tag these as unknown and is_license_intro.

improve the way we deal with multiple "low score" detected licenses in a single file

We could merge matches to different detected "rules" in the same text region when they point to the related license in one match.

We could also handle the cases where we have some mentions of bare GPL (detected as gpl-1.0-plus) together with some GPL version: these could be merged in some case as one match to the versioned GPL. This is mostly for FSF licenses

Report unknowns separately

We should use a separate section of the scan results to report "unknown" license detections, not mixed with the main license detection for clarity.
This could be new section as "unknown_license_references" where we report the matched text and positions but we cannot report a specific matched license rule or key.

Improve the detection of unknown licenses

Beside or as a replacement to the actual detection of the "unknown" license rules, we should have a new way and more efficient way to detect unknown licenses using ngrams. The process would be roughly:

run the regular license detection.
remove from results and keep aside any match with a low coverage below a threshold (eventually merge them too)
collect all the spans of scanned text that are not matched to any license
run these spans through an automaton index that will contain ngrams from all regular license texts and rules
merge all these matched spans (and the possible also the weak matches) in a single matched span.
Consider also including any "unknown" rule matches.
splits in multiple spans based on having some large enough gaps in the match.
report these as "unknown_license_references"

Follow license references to another file

This is for references such as "see COPYING for license" to report the detected license in the referenced file if any e.g. mentions to look for license details in another file.

We should have a way to follow license references using the referenced_filenames attribute and find the detected license in these filenames. And when such a conclusive reference is found positively, we should update the match to use the referenced license(s) that were detected.

This is tracked in #1364

Note that for now, this does NOT include following URLs which would imply having network access.

Properly handle "intro texts" that are used generically to introduce license terms

The are "lead-in"-like intro texts and are detected alone and qualified as "unknown" today. For instance "License agreement" is a rule that would be detected as such.

We should have a way to detect such a text fragment and to discard or merge such a detection if this is immediately tied to an actual "named" license detection. This would avoid many unknown detections.

The text was updated successfully, but these errors were encountered:

mjherzog · 2019-08-08T15:01:07Z

This all makes good sense, but I think that we need to look a bit harder at other-copyleft and other-permissive because there seems to be a non-trivial difference between one-off/rare licenses and unknown license text. One approach might be to design a naming convention like other-copyleft-001 etc for cases where we do have a match to meaningful license text, but the license is very rare.

Per SPDX these two are the very same. Signed-off-by: Philippe Ombredanne <[email protected]>

LOV3hurtS · 2019-08-21T23:33:37Z

Awesome

pombredanne · 2019-08-22T08:39:55Z

@LOV3hurtS thank you for chiming in... can you elaborate?

With the introduction of the "is_license_intro" flag, the license that are introductions texts are now tagged correctly with this flag. All have also a license expression of "unknown-license-reference" Signed-off-by: Philippe Ombredanne <[email protected]>

This way it becomes easy to spot them Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

Also improve attribute flags documentation Signed-off-by: Philippe Ombredanne <[email protected]>

AyanSinhaMahapatra · 2021-06-08T19:45:08Z

@pombredanne One question regarding the addition of the is_unknown flag,

There are the cases of having just these as license_expressions for which things are simple:

unknown-license-reference -> 403 cases
free-unknown -> 174 cases
unknown (total) -> 649 cases

There are also these license expressions with more than one license key, with one of them being of the unknown type:

mit AND unknown-license-reference
gpl-2.0 AND unknown-license-reference
mit AND unknown
(gpl-2.0 OR bsd-simplified OR cpl-1.0) AND free-unknown
cpal-1.0 OR free-unknown
unknown AND gpl-2.0-plus
apache-2.0 AND free-unknown
gpl-2.0 WITH openssl-exception-gpl-3.0-plus AND unknown
gpl-2.0 AND free-unknown
agpl-3.0 AND unknown
unknown-license-reference AND libpng
gpl-3.0 AND proprietary-license AND unknown
mit AND free-unknown
gpl-1.0-plus AND lgpl-2.0-plus AND free-unknown
bsd-new AND epl-2.0 AND free-unknown
other-permissive AND free-unknown
lgpl-3.0 WITH independent-module-linking-exception AND free-unknown
gpl-2.0-plus WITH font-exception-gpl AND unknown-license-reference
(gpl-2.0-plus OR commercial-license) AND unknown-license-reference
unlicense AND unknown-license-reference
uoi-ncsa AND unknown-license-reference
gpl-2.0 AND unknown
gpl-1.0-plus AND public-domain AND mit AND bsd-new AND free-unknown
lgpl-2.1-plus AND free-unknown
warranty-disclaimer AND unknown-license-reference
(gpl-2.0 OR bsd-new) AND unknown
isc AND free-unknown
bsd-simplified AND unknown
lgpl-3.0-plus WITH independent-module-linking-exception AND free-unknown
unknown-license-reference AND jam
openpub AND free-unknown
public-domain AND unknown-license-reference
mpl-2.0 AND free-unknown
(gpl-2.0 OR agpl-3.0) AND unknown
unknown AND (gpl-2.0-plus OR lgpl-2.0-plus)
lgpl-2.0 AND unknown-license-reference
eupl-1.2 AND unknown
bsd-top AND gpl-2.0-plus AND free-unknown
free-unknown AND other-permissive
gpl-2.0-plus AND unknown
mpl-1.1 AND free-unknown
apache-2.0 OR free-unknown
unknown OR proprietary-license
bsd-simplified AND bsd-new AND unknown-license-reference
agpl-3.0 AND other-copyleft AND unknown
gpl-1.0-plus AND unknown
lgpl-3.0 AND unknown-license-reference
public-domain AND unknown
gpl-3.0 WITH aptana-exception-3.0 AND unknown
gpl-1.0-plus AND other-copyleft AND free-unknown
unknown-license-reference AND generic-trademark
(cpl-1.0 OR bsd-simplified OR gpl-2.0) AND free-unknown
apsl-1.2 AND unknown
(gpl-3.0 OR commercial-license) AND unknown
bsd-new AND unknown
lgpl-2.0-plus AND unknown-license-reference
public-domain OR free-unknown
(bsd-simplified OR unknown-license-reference) AND odbl-1.0
gpl-2.0-plus AND free-unknown
unknown-license-reference AND apache-2.0 AND proprietary-license

An example rule and .yml file with license-expression: (gpl-3.0 OR commercial-license) AND unknown

Like in this case the unknown was for Please note that 3rd party libraries are licensed under its own licenses. being present. For this cases also the is_unknown will be set to True right?

What about the cases where the unknown is there with an OR and not an AND?
Like these : Example .yml file

apache-2.0 OR free-unknown ---> the text "dual licensed under the Apache 2.0 license" is silly and may be this could be removed?
unknown OR proprietary-license --> likely  a candidate for update/change/removal too
cpal-1.0 OR free-unknown
public-domain OR free-unknown

Can we say these are not unknown maybe?

@akugarg for your project

akugarg · 2021-06-09T11:17:49Z

Also for the cases like
https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/mit_and_other-permissive_and_other-copyleft_1.yml
AND
https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/gpl-1.0-plus_and_lgpl-2.0-plus_and_bsd-new_and_public-domain.yml
for them 'is_generic' will be set to true?

AyanSinhaMahapatra · 2021-06-11T13:34:58Z

@akugarg btw there's also the unknown-spdx license-expression, in which case also we have to set the is_unknown flag to True.

Here a LicenseSymbolLike with unknown-spdx object is created:
https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/match_spdx_lid.py#L75

Here is we do not find a match to any known spdx key, we return as a license_expression -> unknown_spdx,
https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/match_spdx_lid.py#L147

Here we create a synthetic SpdxRule object for an spdx match, in this case the is_unknown flag of which should be set to True.
https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/match_spdx_lid.py#L89

@pombredanne am I suggesting this in the right way?

AyanSinhaMahapatra · 2021-07-08T13:18:28Z

@akugarg

Example for unknown matching



Rule 1:

Redistribution in source is permitted
Redistribution in binary is prohibited

key: foo

-------------------

Rule 2:

distribution in source is okay
distribution in binary is okay

key: bar

-------------------

Index: ngrams[

	- Redistribution in source
	- in source is
	- source is permitted
	- Redistribution in binary
	- in binary is
	- binary is prohibited
	- distribution in source
	- distribution in binary
	- source is okay
	- binary is okay

]

Text:

Distribution in source is permitted always

Matches

	- Distribution in source
	- in source is
	- source is permitted

Merge Match: 

	- Distribution in source is permitted

Note that this is NOT YET returned in the API and outputs Signed-off-by: Philippe Ombredanne <[email protected]>

Detect unknown licenses #1675 Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne · 2022-01-09T17:01:31Z

#2592 now implements an ngram-based approach with this logic:

run license detection proper
if unknown detection is requested, toward the end of license matching, collect the region of a file that are not matched or weakly matched
match using ngrams with an index built from all the licenses and license rules
reinject weak matches and do a final merge and filter

In addition we also have these changes merged in:

licenses are tagged with is_unknown to tell that this is an unknown license, irrespective of their key
licenses are tagged with is_generic to tell that this is an unknown license, irrespective of their key
a rule have an has_unknown is an of the licenses in its expression is_unknown
a rule is_license_intro is set for rules that are "lead in introductory" rules and are only introducing the following license notice or text (which is already used in debian copyright processing)
there has not been any serious renaming or retiring of the "unknown" license key for now.
the filtering of several spurious detections has been improved.
there is a new, not-yet-used LicenseDetection class that will be used to group multiple LicenseMatch together such as license file references as in "See license in COPYING" and COPYING match; license intros as in "this is licensed under..." and the following match, multiple similar redundant detections in the same file (such as "MIT license" and the full MIT text).

pombredanne · 2022-01-13T07:33:58Z

@sameer1046 FYI

pombredanne · 2022-03-02T15:50:17Z

As part of this we should also revisit the meaning of "unknown" licenses.
We have at least these today in the "Unstated License" Category:

free-unknown
generic-exception
generic-export-compliance
generic-tos
generic-trademark
unknown
unknown-license-reference
unknown-spdx
other-permissive
other-copyleft

As discussed with @DennisClark most of these licenses have inadequate notes to explain exactly what they designate.
We definitely need to provide or improve good notes for each license to clarify their meaning and usage, and perhaps a description of how they are detected. We need to consider new license keys to replace them, something like "custom-license-reference" or "custom-license-text" or both to indicate more clearly the situation where a license notice in a software project simply does not clearly point to some exact license on a public list or to a license that did not receive a name in ScanCode, because it really just means that it is not on a list, not that there is no license statement.

What this really means is that in most cases, the detected license is NOT "unknown" but rather it would fall in one of these cases:

a match to a license that has not yet been assigned a proper license key, but this is proper, non ambiguous license statement
an ambiguous, but real license statement that needs review
a mere license clue, which should not be reported as a license match, but as a mere clue
some license-related statement which could either be a mere clue or should be folded with surrounding license notices and statements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Revamp "unknown" license detection #1675

RFC: Revamp "unknown" license detection #1675

pombredanne commented Aug 8, 2019 •

edited

Loading

mjherzog commented Aug 8, 2019 •

edited

Loading

LOV3hurtS commented Aug 21, 2019

pombredanne commented Aug 22, 2019 •

edited

Loading

AyanSinhaMahapatra commented Jun 8, 2021 •

edited by pombredanne

Loading

akugarg commented Jun 9, 2021 •

edited

Loading

AyanSinhaMahapatra commented Jun 11, 2021

AyanSinhaMahapatra commented Jul 8, 2021

pombredanne commented Jan 9, 2022

pombredanne commented Jan 13, 2022

pombredanne commented Mar 2, 2022

AyanSinhaMahapatra commented Jan 4, 2023

RFC: Revamp "unknown" license detection #1675

RFC: Revamp "unknown" license detection #1675

Comments

pombredanne commented Aug 8, 2019 • edited Loading

Context and Problem

Improve the License data model definition (e.g. the License "YAML" data)

So to handle these all these, I suggest we could add these new model attributes :

License keys cleanup and retirement : use fewer license keys for unknown things.

improve the way we deal with multiple "low score" detected licenses in a single file

Report unknowns separately

Improve the detection of unknown licenses

Follow license references to another file

Properly handle "intro texts" that are used generically to introduce license terms

mjherzog commented Aug 8, 2019 • edited Loading

LOV3hurtS commented Aug 21, 2019

pombredanne commented Aug 22, 2019 • edited Loading

AyanSinhaMahapatra commented Jun 8, 2021 • edited by pombredanne Loading

akugarg commented Jun 9, 2021 • edited Loading

AyanSinhaMahapatra commented Jun 11, 2021

AyanSinhaMahapatra commented Jul 8, 2021

pombredanne commented Jan 9, 2022

pombredanne commented Jan 13, 2022

pombredanne commented Mar 2, 2022

AyanSinhaMahapatra commented Jan 4, 2023

pombredanne commented Aug 8, 2019 •

edited

Loading

mjherzog commented Aug 8, 2019 •

edited

Loading

pombredanne commented Aug 22, 2019 •

edited

Loading

AyanSinhaMahapatra commented Jun 8, 2021 •

edited by pombredanne

Loading

akugarg commented Jun 9, 2021 •

edited

Loading