-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computable open-source version information #87
Comments
+1 for GIT commit IDs as they help locate the vulnerable instances of code more precisely than versions. Software changes represent a tree (directed acyclic graph) structure. Each commit results in new software - a node in this tree. Each fork results in a new branch. For any given node and any given vulnerability:
It is possible that we have two more colors (but we can ignore them for now for simplicity)
The problem we have: given a three-colored tree, we need to encode/serialize the graph so it captures this information as accurately, with less ambiguities, and allows easily determine if any given version is affected. For eg.,
Lets say 3.0 was branched off sometime before 2.10 was released, and 4.0 was branched off before 3.2 was released. If the JSON was like:
Though it is easy to compute that 4.0 thru 4.2 are vulnerable, how do you determine the vulnerability status of 3.0 thru 3.3 and 2.8 thru 2.12 (except 2.10)? |
Extending on that above example, this can be expressed like so: Assuming that:
"affects": {
"ranges": [
{"type": "ECOSYSTEM", "introduced": "2.9", "fixed": "2.10"},
{"type": "ECOSYSTEM", "introduced": "3.0", "fixed": "3.2"},
{"type": "ECOSYSTEM", "introduced": "4.0", "fixed": "4.3"},
]
} These conditions are evaluated with OR. A version is affected if it falls into any of those ranges. Using this we should be able to describe any set of ranges unambiguously. Describing ranges this way also makes it more easily understandable by human users if they want to know which versions they upgrade to if they're impacted. |
Proposal Replace
Rationale The discussion at the quality working group meeting brought up two important points:
At the meeting, we spent a little while trying to figure out how to get all that into a single version object. Afterward, based on additional thought and discussion, @ochang and I propose that it would make sense to have two separate lists, with different uses and consumers.
These two separate lists seem to separate out two distinct use cases nicely, making it possible to serve both well with separate mechanisms, where before it seemed impossible to serve them well with a single mechanism. At the working group meeting it sounded like there was consensus to move [Edit: Streamlined the two objects a bit.] |
I noted above that I moved platforms out, as discussed, and also references, since the same rationales seemed to apply. I think maybe we should move repo out as well. The code will come from a single repo that will not vary from version to version. So repo could go into the outer product object too. |
testedVersions seems to be ok for making "not affected" assertions per single version. How do we encode a range of not affected? Consider an example: a vuln in introduced in 2.12 fixed in 2.14, but due to some reason (like a mistake in resolving code conflicts) 2.16 and 2.17 are vulnerable again and it then gets fixed in 2.18. (Such things are rare but do happen). .. 1.0 → 1.1 → 1.2 → 1.3 ... ( *= affected)
How do we affirmatively say 2.14 and 2.15 are not affected as a range? let's say 1.x was an unmaintained branch that was never evaluated. Since the bug was introduced in 2.12 the CNA wants to assert 1.x is unlikely to be affected. Instead of testedVersions and unspecified can we do this with an optional rangeAffected ['affected' (default), 'unaffected', 'unspecified', 'likely', 'unlikely']?
(a consumer can consider likely to be same as affected for vulnerability management) Another suggestion is to add a new value for range : "patch" - for products that use patching (like CVE-2017-4905 )
|
+1 for a rangeAffected type quantifier. This would allow the schema to simplify affectedVersions and testedVersions into a common versions property. This ensures parity in affected and unaffected version expression without duplication of the field. This change would ultimately open up the version property for more additions in subsequent minor versions if desired, such as likely and unlikely, without strictly breaking compatibility. |
Hi @rsc . Thanks for providing that proposal! A few things that come to mind if we were to adopt this, from a naive perspective (on purpose): 1.) How would I specify a range if the vuln isn't fixed yet and/or plans to fix are not known? Currently, I'd be required to provide a 2.) The 3.) I understand the use-case for I guess my only question is (and perhaps it's just a question of naming): Is it semantically sound to separate Put another way, if I'm a security engineer at a vendor, and I'm assigning a CVE for package Thanks again for doing this, and I'd re-iterate the importance of documenting the implications you mentioned such as:
etc... Some areas of the schema are widely up to user interpretation for usage, and others it seems beneficial for the community to have some conformity on, so just want to ensure we make those areas well known, as this data is only useful when interpreted properly. Lastly, I think you meant to tag @oliverchang ! :) |
@cganas thanks for the feedback.
A common versions property has the problem of not having a clear meaning for versions not explicitly listed. The two different use cases have two different natural semantics:
Merging these different natural semantics into a single field makes the meaning of unlisted contradictory and unclear. It seems like a significant step forward in clarity to separate the two uses. |
@tcullum-rh thanks for the feedback
Thanks for pointing out the username snafu. Sorry @oliverchang! |
@chandanbn thanks for the feedback.
I think it would be fine to encode ranges there, although a researcher without access to the source code repo may have difficulty making such broad assertions. We could do it the same way as in the affectedVersions list.
This runs back into the issue I was trying to solve with the split, which I mentioned in #87 (comment) as well. Specifically, once there is an explicit status that has the same meaning as not listing a version at all, then it becomes unclear whether you are supposed to list things explicitly or not. Consider:
vs
What does each say about 1.4.5? The idea was that the affectedVersions line says (by not listing it) that 1.4.5 is unaffected, If there is a single field, then unlisted can have only one meaning. If unlisted means unaffected, then the security researcher has to write something like:
when all they really want to say is "there's a vulnerability in 1.2.3". If unlisted means status unknown, then the vendor issuing instructions to users needs to write
when all they really want to say is "only 1.2.3 is affected". It seems like inevitably people are going to write the 1-line version when they "should" be writing the 3-line versions. The two different fields allow two different defaults, which should make the authoring of these more natural and less prone to error, as well as clearer in meaning. |
IMHO we are solving two problems here:
When software versioning is linear, listing the affected range is sufficient. A tool should interpret versions outside of the range as 'unaffected'. This proposal is perfectly adequate and intuitive to use. The difficulty comes when software has multiple concurrently maintained branches (e.g., Linux, OpenSSL). Ranges that span multiple branches may not make sense. Often CVE assigner does not make statements about older branches, they may not be listed in a CVE, but are likely affected. Without this additional context (like EOL) a tool can misreport an older version as unaffected. That is dangerous because people may have a vulnerability they should care about, but tools may fail to warn them. Take https://www.linuxkernelcves.com/cves/CVE-2021-3655
Since 4.15.1 isn't listed there should a tool report it as unaffected? My suggestion to solve the info capture problem:
To interpret the records:
At the minimum when there is only one range of affected versions, this is sufficient:
When there are branches with multiple ranges, this should be sufficient:
A few optional entries reinforce the facts and would help tooling make accurate determinations.
How about:
|
My concern with We may want something like this instead to describe a "versionGroup" instead of just a "string".
This does make the entries a bit difficult to read as a human if they're inline (because they are two entries in each), so perhaps it could be indirect by adding a new field to define versionGroups, and have the individual ranges reference that (as per your examples).
On interpreting these entries: I think different consumers will want some flexibility depending on risk / noise appetite, as it's ultimately up to the consumer how to deal with incomplete data. For example, they could assume (or know) the data is high quality/complete and ignore versionGroup altogether, and assuming anything that's unlisted is strictly unaffected (rather than unspecified / unknown). My understanding is grouping ranges by Using this as an example again:
Testing Testing Testing Ignoring Does my understanding seem correct? In any case, this doesn't change the meaning of {version, before} within a versionGroup -- because a version that doesn't match any (non-unspecified) ranges within a group still unambiguously means "unaffected". So I don't know if it answers whether we need both |
Chandan proposes " { versionGroup: 4.14, start: 4.14.0, before: 4.14.240 } " and this matches my experience handling vulnerability metadata for OpenSSL and various Apache projects (where they are not semver). For OpenSSL we combined having a 'fixed version' (for a given major version) along with listing all the known affected versions indvidually: https://www.openssl.org/news/vulnerabilities.xml
which would become " { versionGroup: 1.1.1, start: 1.1.1d, before: 1.1.1g } " or
which would become " { versionGroup: 1.1.1, start: 1.1.1, before: 1.1.1e } , { versionGroup: 1.0.2, start: 1.0.2, before: 1.0.2u } , " Problem 1: quite often the OSS project doesn't have resources to make sure we know "earliest affected version" (for example it might be too hard to determine what old things are affected particularly if things got refactored). So does the lack of 1.0.2 in that first example mean it's not vulnerable (which it does) or that we no longer look at how 1.0.2 is affected? Problem 2: So if there is an old EOL branch it's quite likely the OSS project won't even look if that one was vulnerable. So how about the OpenSSL 0.9.8 version? As the upstream we don't tell you. But other consumers of OpenSSL who patched it after upstream stopped (like long life distro branches, Red Hat etc), probably did that work to figure out all the affected EOL versions too. Second example which is similar, before I switched ASF httpd to JSON 4.0.... view-source:https://web.archive.org/web/20200416103646/http://httpd.apache.org/security/vulnerabilities-httpd.xml
So that would become " { versionGroup: 2.2, start: 2.2.0, before: 2.2.34 } , { versionGroup: 2.4, start: 2.4.1, before: 2.4.27 } , " But for ASF when we hadn't verified but it looked plausible....
(Although for the JSON format I just lazy converted those into 'affects') (We also had the occasional "won't fix" where "2.2.* is affected, we didn't fix it in 2.2" and the occasional "2.2.* is affected, it's fixed by an available patch/svn head, but not in any released version") Problem 3: Distro versions will vary. You could normally just say this is out of scope, but it's likely most of the users of say OpenSSL will be using a distro packaged version. And they backport security fixes. It's why at Red Hat we introduced OVAL for all our errata so you could map a given Red Hat RPM version of (Apache HTTP Server, OpenSSL, anything) to CVE. |
As you said if the data set is complete, we don't need versionGroup. A tool can easily say anything unlisted in unaffected. When the data is incomplete (and it will often be), telling consumers/tools to assume the unlisted is unaffected is dangerous. Take CVE-2021-33909 for example: Whoever requested the CVE at the time of assignment may have said it affected Linux Kernel from 3.16 to before 5.13.4. Now that vulnerability seems to have been fixed in each of the actively maintained Linux kernel branches - each fixed with a different commit id for eg., 4.14 --> before: 3c07d1335d17ae0411101024de438dbc3734e992 The entry in OSV seems to have picked only one affected range with a fix commit id for just one branch 4.14. So the list of versions listed as affected is not telling the whole truth. For eg., it does not list 5.13.3 as affected. If one were to take anything not listed as unaffected, then a tool consuming that data would wrongly (and dangerously) say 5.13.3 is unaffected which is not true here. I believe we all agree:
Given the above:
|
Thanks for flagging this example! This was actually an intentional decision by the providers of this data to track different branches in different vulnerability IDs. For example, for the 5.13 branch, this is tracked by https://osv.dev/vulnerability/UVI-2021-1001182. There are other variations for different branches, and with open source we the ability to be precise/complete with tooling to detect cherry picks across branches etc. But yes, I understand the concern with incomplete data in general!
I don't believe semver (or most versioning) schemes enforce any conventions around branch versioning. If we provide clear rules on how to match a version to a group by saying it's a string prefix, (i.e.
What you proposed with versionGroups seems like it should address most of these, but I think it adds a fair bit of complexity and edge cases for processors to handle. Perhaps another flatter alternative, and one that tries to make the two cases (complete vs incomplete data) more explicit would be:
SemanticsWhen a version is not included in the list of
When @chandanbn you also had "undefined, likely-affected, unlikely-affected" in your
An algorithm to interpret these resultsAn algorithm can give four possible results about an input version: "affected", "unaffected", "likely-affected", "likely-unaffected". If If Otherwise, it's "unspecified". If the version is unspecified at this point, then tooling can interpret it like so:
@rsc @chandanbn what do you think? I think if we do it this way, we can also stick with a single |
@oliverchang I like an indicator of completeness (versionsInfo.complete). versionsInfo.knownVersionPrefixes seems like an aggregation of versionGroups. Not sure if we are achieving anything by separating them out to a different field. Having some guidance on how to record a versionGroup name should also help tooling. Prefix matching can be tough unless there is an odd looking period at the end (
|
I think it simplifies the evaluation algorithm and prevents some edge cases when dealing with open ranges within a a group. e.g.
The interpretation here would be, everything in 4.14 and 4.15 is affected. In the case this describes an incomplete set of versions, if we have "4.16.1". It should be "unlikely-unaffected" because it's newer than all versions, but there's no actual versions to compare it to in the two ranges (they're both "*"). There would have to be a way to compare "4.16.1" to an actual group ("4.15"), which seems difficult to do in a generalisable way. It also adds complexity to evaluating these rules even if this describes a complete set of versions.
Sure, but I think since versionGroup/Prefix is essential to determining if a version is affected, it needs to be unambiguously computable by tooling. I think we will need either prefix (or pattern matching/regex) for that. Re patching, perhaps another way would be to just have:
? That way, versionGroup/Prefix can have consistent automatable rules. |
@chandanbn thanks for the example of the Linux kernel vulnerability. @oliverchang and I spoke for a while and didn't come up with an obvious win yet. |
This issue is about making version information computable, meaning that there is a clear algorithm IsVersionAffected that takes as input a CVE record and a specific version and answers the question “is this version affected by this CVE?” There are two concerns: (1) defining something precise enough for an algorithm to implement, and (2) defining something clear enough that people writing these records - and also the people implementing the algorithm - get it right. There are many, many ways to do (1) but relatively fewer ways to do (2). We already have the problem of needing to define specific version types to make even a less-than comparison work. A versionGroup adds another kind of definition on top of that. Also, version groups assume a particular development model that may or may not hold. For example if v4 and v5 are being developed independently, then you might want to say that it is fixed in v4.19.2 onward within v4 (including v4.20 but not including v5) and then separately also fixed in v5 starting at v5.13.4. It seems like it would be better to have fewer concepts if we can, which is to say leave versionGroup out if we can. I think we should separate out point-wise assertions from ranges, because pointwise assertions don't require understanding the relative ordering of versions. Suppose we did this:
This would replace both the affectedVersions and testedVersions in my previous attempt. If a version appears explicitly in the version list, then the answer is the given status. That's the easy part. Otherwise, we consult the ranges. Each range specifies the version type (semver, git, linux, etc) and an optional initial status and then a "timeline" ("versionline"?) of where the status changes. For the Linux kernel bug we could use:
This effectively encodes this picture of the version timeline:
Normally you'd have only one versionRange for a given type. The algorithm is to find the versionRange for the type of version you are holding and then do:
This seems pretty clear for both readers and programmers. I think this encodes the ranges clearly and without the duplication that's needed for a list of [start,before) spans (where each one's before is usually the next one's start). It also explicitly allows status "unknown" (and makes that the default), and we could add status "likely" or "probable" if necessary. Thoughts? |
@rsc Wouldn't this be essentially restricting the use of existing versionAffected to '>=', '!>='?
Isn't the comparison here still the version-tree (directed acyclic graph) based comparison? For git, one must query the SCM to find one commit is hash is before or after another commit hash. Since we capture the git repo URL, I feel this is computable. For semvers or anything else, I see a few requirements:
BTW, for the Linux kernel example above only the seven fixed branches seem to be tracked. The sum total of Affected versions (aggregated from those 7 ids in OSV) would miss any version from an unmaintained Linux kernel branch (such as 5.12.10). However using the suggested record format and the algorithm querying the SCM (git repo) on git commit ids, one would in theory correctly identify 5.12.10 as affected. |
I suppose it's restricting the use to purely a sequence of '>=', with the rule that later entries override earlier ones.
Yes, the comparison has to be defined by the 'type' entry in the range object.
Agreed.
Agreed. And that really is a concern, but we could potentially define that in the semver ordering you can write 4.20 (no third number) to mean anything starting with 4.20, including prereleases.
Yes, I agree with that. I don't think the 7 different IDs are a good approach. It actually makes it almost impossible to say what is and is not affected. @oliverchang is going to talk to the UVI team about why they chose that approach. We should strive for a single ID in CVE.
Yes, and one of the things we hope OSV will be able to contribute to the CVE ecosystem once data is in this format is suggesting updates where the git commits indicate that the numeric version ranges can be made more precise. |
Regarding "sorted on start versions (easy)": I hope that CVE records will be written with sorted lists anyway, perhaps with automation to keep them sorted, but I agree that clients should be expected to sort too. (Technically speaking it is not necessary for the client to sort, only to find the status line with the largest version <= the version being checked. That's O(n) instead of O(n log n). But I think it is fine to say that clients should behave as if they sorted the list and leave not sorting as an optimization.) Most versioning numbering systems have a clear linear ordering: v1.2.3 before v1.2.4 before v1.3.0 before v2.0.0. For a Git commit graph, all we can do is sort by topological order (parents before children). That's still easy, it's just important to recognize it as not quite normal sorting. The algorithm and the data format still make sense for this kind of directed acyclic graph. For example the Git commit ranges for CVE-2021-33909 would be written:
This turns out to be a clear improvement over the original ranges, because you don't have to say the commit that introduced the bug 7 times. |
Maybe the best approach is to have multiple options for expressing version information, depending (in part) on whether the product has a support policy (explicit or implied). The type of information submitted to the CVE Program tends to have a bifurcation depending on whether a support policy exists, even when the existence of a support policy is not mentioned within the vulnerability announcement itself. Although CVE is not really "about" prescriptive information from vendors, it may be more likely for vendors to participate if the information displayed in CVE Records, and the information available to CVE-based tools, is closely aligned to what the vendor provides directly to customers, either within vulnerability announcements or during customer-support interactions. In other words, the approach potentially helps with CVE adoption. The hope is to develop the best practical algorithm within the context of what data providers have traditionally been willing to submit to the CVE Program. It should avoid soliciting extra information such as "{start: v4.20, status: affected}" which, in practice, is very rare to see from program participants. For example, many people who rely on the 4.19.* longterm-supported Linux kernel series are unaware of whether 4.20.x ever existed (or whether 5.0 came right after a 4.19.x version). Similarly, if a vulnerability announcement mentions a 3.4.x fix and a 3.6.x fix, does that mean that 3.5.x is "affected" and potentially important, or does it mean that odd minor-version numbers are never visible outside of the development staff? CVE Records are for vulnerabilities in released software. For purposes of CVE, it is not necessary to state which commits are associated with the vulnerability lifecycle, or to express whether any specific pre-release software came before or after a released version. Here is a very rough outline of how the schema could accept four different major types of version specification.
Semantics: If the consumer's product version does not match any of the assessedSemverRegexp regular expressions, then the output of the algorithm is the word Unsupported. This means that the vendor is recommending against use of that version. For vulnerability management purposes, this may be treated the same as the word Affected. Otherwise, if one regular expression is matched, and assessmentPending is found, then the output of the algorithm is the word Unknown. Otherwise, if one regular expression is matched, and the consumer's product version is greater than or equal to the fixedStartingFrom value, then the output of the algorithm is the word Fixed. Otherwise, if one regular expression is matched, and the consumer's product version is within any specified otherUnaffected range, then the output of the algorithm is the word Fixed. Otherwise, the output of the algorithm is the word Affected. Note: otherUnaffected is optional. Although producers are free to choose their own use cases, the envisioned primary use case is a situation where the vulnerability was introduced in a very recent version. Thus, there are expected to be many customer deployments that are completely safe (e.g., not affected by any CVE or any vulnerability that was silently fixed by the vendor), and therefore it's a waste of customer effort to trigger updates. In one example below, only 4.9.359 was affected. Commercial software vendors typically only express the version numbers of new versions that have fixed a vulnerability. From the perspective of many commercial software vendors, a vulnerability announcement has two purposes: to protect customers from attacks, and to lower support costs by reducing the variety of versions deployed in the field. example with only one assessedSemverRegexp item
example with multiple assessedSemverRegexp items
Semantics: if the customer's product version does not equal any of the assessedBaseVersion values, then the output of the algorithm is the word Unsupported. For vulnerability management, this may be treated the same as the word Affected. Otherwise, if the customer's product version equals one of the updateOptions values, or equals one of the otherUnaffected values, then the output of the algorithm is the word Fixed. Otherwise, the output of the algorithm is the word Affected. Clearly, vendors who don't (or can't) provide updateOptions values will trigger many false positives (if the CVE List is the sole data source for vulnerability assessment). This is primarily for vendors who submit CVE Records that state a set of product versions, each of which may be vulnerable depending on whether an update action has occurred (e.g., installing a service pack, fix pack, hotfix, patch, etc.). In many cases, the CVE Record does not fully describe the update action (possibly because that action is dynamically chosen based on details of a customer environment). Thus, updateOptions (a set of update actions, any of which is sufficient to fix the vulnerability) can be specified, but is optional. example in which updateOptions is not provided
examples in which updateOptions is provided
Semantics If the consumer's product version was tested and found to be affected, then the output of the algorithm is the word Affected. If the consumer's product version was tested and found to be not affected, then the output of the algorithm is the word Fixed. Otherwise, the output of the algorithm is the word Unknown. A. examples that may be typical of automated testing
B. examples that may be typical of manual testing
Semantics If the consumer's product version is a semver on the unaffectedSemverList, or a later semver, or a version on an unaffectedList, then the output of the algorithm is the word Fixed. Otherwise, if the consumer's product version is in the specificAffected field, then the output of the algorithm is the word Affected. Otherwise the output of the algorithm is the word Unknown (possibly accompanied by a comment).
|
For what it's worth, this seems self-defeating to me. I think the idea of comments and suggested upgrades are interesting, Finally, speaking from experience, regular expressions are not a good answer: |
A few comments, some of which have already been discussed but I didn't see a clear decision:
Another approach to the "Tested" list is to just stick with affected/not affected but identify the subject of the claim. Researcher can state that "version 1.1 is affected" and supplier/vendor/project can state "version 1.1 is not affected" and I can parse out that there's a disagreement and I need to go investigate. This avoids giving the vendor/project/supplier ultimate authority in the claim, in that researcher testing is inferior to vendor statements (this might often be true, but not always, to a non-trivial degree). If comprehensive testing is not assumed (i.e., not listed as affected == not affected), then a way to convey "Not affected" is useful. In this model, unlisted version implies nothing, there needs to be an explicit statement of affected or not. And another list, "Supported" (and possibly unsupported). As a consumer of this information, I'd like to know who is making the claim, what version/ranges are affected, what are not, what is unknown, and what is unsupported (or wontfix). |
The shorthand version of this schema is: versions: [{ version: $version status: $status // unknown, affected, unaffected; unsupported? versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) limit: $versionLimit (this range stops just before limit; can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, limit: L, status: S, changes: C}, which indicates the status of the half-open interval [V, L) (that is, V is included but L is not). The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in versions { if entry.limit is not present and v == entry.version { return entry.status } if entry.limit is present and entry.version <= v and v < entry.limit { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return "unknown" Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
@pombredanne and @chandanbn, for what it's worth, I disagree that ranges are only human hints and can never be treated as precise by computers. It's true that you have to be careful to make them precise, and in particular you need to say what the numbering system is (versionType here) and have that system be well-defined. If it's not, then yes, the best you can do is an enumeration, perhaps sanity checked by a version control range. In Go in particular (which uses semver numbering), it is possible to generate a semver version corresponding to each commit to a repo. It would not make sense to require a CVE to enumerate every single commit when a simple (and much shorter) range can be specified instead. But we could still have git ranges and semver ranges and cross-check the meaning of the semver ranges against the git ranges. The required enumeration is also problematic for commercial software when a vendor wants to say "fixed in 5.2" and not enumerate all the prior versions that were affected. A range makes that easy to express. There may be no complete enumeration. I agree that it can be a fine approach to do both the enumeration and the ranges and have some kind of automation to cross-check them - or a semver range and a git range, again cross-checked. That works especially well for open source. But I don't believe that approach can be required of every situation. (One thing I've come to appreciate from all these discussions is the sheer breadth of situations that CVE must be able to capture.) |
@ElectricNroff, if the vendor has guaranteed all those things, I don't see a problem with the as-yet-nonexistent version 3.1.0 in:
Generally speaking, predicting the future is hard. Instead of layering additional ways to set down predictions about the future, it seems much better to make it easy for vendors to update their CVE records as new facts become known. After all, it is also true that customers may pressure the vendor to issue a fix in the 3.0 branch after all. No amount of encoding the future can account for actual changes to the expected future. Instead, we should make it easy for vendors to amend their CVE records. So it also seems fine if the vendor chooses to issue a CVE with:
and then amend the record later when fixes come out. |
The shorthand version of this schema is: versions: [{ version: $version status: $status // unknown, affected, unaffected; unsupported? versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) limit: $versionLimit (this range stops just before limit; can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, limit: L, status: S, changes: C}, which indicates the status of the half-open interval [V, L) (that is, V is included but L is not). The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in versions { if entry.limit is not present and v == entry.version { return entry.status } if entry.limit is present and entry.version <= v and v < entry.limit { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return "unknown" Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
The shorthand version of this schema is: defaultStatus: $status versions: [{ version: $version status: $status // unknown, affected, unaffected; unsupported? versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) limit: $versionLimit (this range stops just before limit; can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, limit: L, status: S, changes: C}, which indicates the status of the half-open interval [V, L) (that is, V is included but L is not). The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in product.versions { if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version { return entry.status } if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return product.defaultStatus Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
The shorthand version of this schema is: defaultStatus: $status versions: [{ version: $version status: $status // unknown, affected, unaffected versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) lessThan/lessThanOrEqual: $version (can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, lessThan: L OR lessThanOrEqual: LE, status: S, changes: C}, which indicates the status of the half-open interval [V, L) or closed interval [V, LE]. The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in product.versions { if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version { return entry.status } if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return product.defaultStatus Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
Changes in latest PR, based on Tuesday meeting discussion:
|
Latest commit message summary: The shorthand version of this schema is:
An object in the versions list can be either:
The algorithm for deciding the status of a particular version V is then:
|
I also added 'custom' as a versionType that is not directly computable without further information. That will be necessary for upconverting the JSON 4.0 data. |
If we are adding lessThan and lessThanOrEqual to allow up-converting <=, do we need a versionAfter to allow up-converting >? I feel we are complicating the structure for backwards compatibility. |
I think it is probably important to rename limit to lessThan for clarity. I do observe that > is significantly less common in the 4.0 data than <=.
I spot-checked the "?>" entries and all the ones I looked at were Jenkins plugins that used the form:
The ?> could be dropped here since unknown would be the default anyway after saying affected in the range [1.5.2, 1.8] (using lessThanOrEqual). I also looked at the > entries and many of them appear to be bugs. For example CVE-2021-0253 says
but https://kb.juniper.net/InfoCenter/index?page=content&id=JSA11146&actp=METADATA says clearly "19.4R3 and above", so this should be ">=". So it does not seem like the case for versionAfter is anywhere near as strong as lessThanOrEqual. |
Thank you for the stats! The numbers for |
In JSON 4, "version_affected": "<=" implies that, somewhere on the timeline after version_value, an event occurs such that the status is no longer asserted to be "affected" - and "unaffected" and "unknown" are both plausible post-event statuses. Here, "the timeline" is used to mean any of the mechanisms for entering version data, e.g., changes, version, or lessThan. The argument for lessThanOrEqual in JSON 5 is:
For this last point, another entity (e.g., a commercial vulnerability-assessment product) may currently be relying on https://github.com/CVEProject/cvelist to deliver computable data to its own constituents, e.g., with a more complex algorithm such as:
If upconversion always maps <= to the same post-event status, then it's impossible for that entity (using only the JSON 5 document set) to deliver the data quality that they previously delivered. Also, having them continue to use the JSON 4 document set forever isn't a good solution because, starting sometime in 2022, the JSON 4 document set will reach end-of-life. Examples:
The situation may be less consistent when:
|
FWIW, I think that the content/diagrams in the introductory slides above should at least be referenced somewhere in the docs for the version array or in whatever User Guide we eventually create. The visualizations are very important to aid in understanding what is being done here, and understanding is important to proper usage. I generated some docs using |
@ElectricNroff Summarizing your concern there are many CVE entries that simply have information like
alternatively:
The entries were not computable in v4, and they will not be computable in v5. IMHO that is acceptable as this bug/pull request is not about making previously uncomputable info into computable. The CNAs now have better ways to express the same information. update: defaulStatus is set to unaffected. That gives the expected results. |
JSON 4 data that says "before" (aka the < comparison) isn't one of the hardest cases. JSON 4 data that says <= (sometimes expressed as "through v#.#.#") is a hard one. Also, I don't think either of your options for "before" would typically be used. Adjacent entries on an "at" timeline should have different statuses. Also, multiple entries of version zero and the same status can be replaced by the one entry with the highest limit (i.e., the v3 one). If the available data is that versions before 1.7.3, before 2.3.9, and before 3.2.1 are affected, then there are three upconversion options that may be reasonable choices:
Of course, only the third option can be error-free. The third option can often work well for CVE consumers who use the CVE Record data very soon after it's published (e.g., before the vendor has an opportunity to release 3.2.2). This scenario applies to CNAs who will continue to use that < data pattern in their JSON 4 documents that are published after CVE Services 2.0 has launched. |
@chandanbn I think you meant 'defaultStatus: unaffected' throughout #87 (comment) |
@ElectricNroff, with both lessThan and lessThanOrEqual as options, along with the defaultStatus we added at your earlier suggestion, it looks to me like essentially all the JSON 4 data can be encoded faithfully. There is a question of what to do with entries that don't explicitly say "version X and above are unaffected", but that's a question for the converter: whatever the answer should be, it can be encoded precisely and clearly. I can't quite tell: is your last comment arguing in favor of lessThanOrEqual, or are you saying that something else is needed as well? |
The shorthand version of this schema is: defaultStatus: $status versions: [{ version: $version status: $status // unknown, affected, unaffected versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) lessThan/lessThanOrEqual: $version (can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, lessThan: L OR lessThanOrEqual: LE, status: S, changes: C}, which indicates the status of the half-open interval [V, L) or closed interval [V, LE]. The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in product.versions { if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version { return entry.status } if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return product.defaultStatus Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
The shorthand version of this schema is: defaultStatus: $status versions: [{ version: $version status: $status // unknown, affected, unaffected versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) lessThan/lessThanOrEqual: $version (can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, lessThan: L OR lessThanOrEqual: LE, status: S, changes: C}, which indicates the status of the half-open interval [V, L) or closed interval [V, LE]. The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in product.versions { if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version { return entry.status } if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return product.defaultStatus Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
The shorthand version of this schema is: defaultStatus: $status (default 'unknown') versions: [{ version: $version status: $status // unknown, affected, unaffected versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) lessThan/lessThanOrEqual: $version (can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, lessThan: L OR lessThanOrEqual: LE, status: S, changes: C}, which indicates the status of the half-open interval [V, L) or closed interval [V, LE]. The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in product.versions { if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version { return entry.status } if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return product.defaultStatus Versions or defaultStatus may be omitted, but not both, Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
The shorthand version of this schema is: defaultStatus: $status (default 'unknown') versions: [{ version: $version status: $status // unknown, affected, unaffected versionType: string (‘semver’, ‘git’, ..., to define meaning of <) repo: string (optional, intended for versionType ‘git’) lessThan/lessThanOrEqual: $version (can use * for “infinity” aka "maxuint") changes: [{ at: version where status changes status: ... }] }] An object in the versions list can be either: - a simple {version: V, status: S}, which indicates the status of the single version V. - a range {version: V, versionType: T, lessThan: L OR lessThanOrEqual: LE, status: S, changes: C}, which indicates the status of the half-open interval [V, L) or closed interval [V, LE]. The range starts with V having status S and then changes over time according to the events listed in C. The algorithm for deciding the status of a particular version V is then: for entry in product.versions { if entry.lessThan is not present and entry.lessThanOrEqual is not present and v == entry.version { return entry.status } if (entry.lessThan is present and entry.version <= v and v < entry.lessThan) or (entry.lessThanOrEqual is present and entry.version <= v and v <= entry.lessThanOrEqual) { status = entry.status for change in entry.changes { if change.at <= v { status = change.status } } return status } } return product.defaultStatus Versions or defaultStatus may be omitted, but not both, Fixes CVEProject#87. Fixes CVEProject#12. Fixes CVEProject#77.
I feel that the current design (e.g., with defaultStatus, lessThan, and lessThanOrEqual) is adequate, but that (when reasonably achievable) the upconverter should avoid adding explicit assertions that weren't present in the JSON 4 data. For example, from the perspective of the algorithm used by the CVE Program, these two (which could be chosen for <= 3.2.1 in JSON 4 data) are exactly equivalent:
The reason that the first one is preferable is that a different entity (e.g., a commercial vulnerability-assessment product) may have the resources to develop their own algorithm that replaces:
with something like:
if their customers demand that (and if Adobe was unwilling to change the data). In other words, immediately before the "return product.defaultStatus" line is a hook point that third parties can use to insert their own code. In an actual use case, the third party would have to start from the algorithm pseudocode and implement a modified version on their own. The CVE Program isn't planning to package the algorithm as a standalone software product (and, even if it did, the product wouldn't ship with a supported extension framework). |
Follow-up work for CVEProject#87 and CVEProject#88: an introduction to the new product and version schemas. Posted for easier reading at https://gist.github.com/rsc/0b448f99e73bf745eeca1319d882efb2.
Follow-up work for CVEProject#87 and CVEProject#88: an introduction to the new product and version schemas. Posted for easier reading at https://gist.github.com/rsc/0b448f99e73bf745eeca1319d882efb2.
Background
The OSV schema has been adopted by Go, OSV, Python, Rust, and UVI to describe vulnerabilities in open-source software. The OSV schema’s key advantage over the CVE format is that it identifies the specific affected packages and versions in a precise, computable way.
For example, suppose we wanted to check whether a particular software package, as described by an SBOM, made use of any open-source components with known vulnerabilities. An SBOM for a given package ecosystem would be a list of its packages and versions. A tool can test whether each SBOM entry is affected by a database entry written to the OSV schema, without any additional information (such a version or commit graph or access to the repository containing the source code for the open-source software). This is what we mean when we say the package and version identification is computable.
We propose that the new CVE JSON schema be changed to make its package and version identification computable too. This would make it possible for vulnerability-checking tools to check SBOMs against the CVE database as easily as they can currently check SBOMs against OSV-schema databases. Adjusting the CVE JSON schema would also allow OSV-schema databases to embed their information into CVE format, allowing all their vulnerability information to be pushed upstream to the CVE database and then propagated to any CVE-aware software, a net benefit for the entire software ecosystem.
This issue focuses on computable version identification. See issue #86 for computable package identification.
Computable version identification
After identifying that a particular package listed in an SBOM matches a package in a CVE database entry (#NNN), a vulnerability scanner must next identify whether the specific version in the SBOM is considered affected by the CVE. The entry must include self-contained information sufficient to make this decision algorithmically. The current schema does not satisfy this requirement (or else it is unclear how it does).
What is the algorithm for deciding if a version is considered affected? The current spec does not provide details on how to evaluate the rules. At the start, it is unclear whether the “versions” list must be grouped by “versionGroup” before further processing, so we’ll suppose there is a single group in our examples. It was also unclear which logical operator to apply to the version entries. Issue #12 says that rules should be evaluated with AND, which makes it impossible to list individual versions. For example:
The explanation in #12 is that this means “version = 1.0.0 AND version = 1.1.0”, which doesn’t match any version at all.
According to the answer in #12, expressing multiple disjoint ranges of versions is also not possible. For example:
Here it seems clear the intended interpretation would be
but there is no obvious way to encode this. Using ! operators would also not work. There is no boolean normal form with only one logical operator (that is, only AND, or only OR).
A second, related problem with the current schema is that even the definitions of operators like “>=” are not algorithmically precise. Clearly these are not string comparisons: 1.2.0 < 1.10.0. But neither are they simple element-wise comparisons: in packagers using Semver, 1.2.0 > 1.2.0-alpha. In Maven, even the alphabetic parts do not compare with strict regularity. In particular, this ordering applies:
An operator like “>=” cannot be applied without reference to a particular version ordering algorithm, and the CVE schema omits that information.
The different operator variants are also confusing. For example, is there any difference between these two?
Or is this one any different from those two?
The result of “is this version affected?” should be a boolean yes/no, or at worst yes/no/maybe, but the current operators allow yes/no/maybe/undocumented, with no guidance as to what CVEs should do. Should tools treat “no” differently from “undocumented”? Is it a best practice to document all the negative ranges too? Why?
The CVE schema needs to address these deficiencies so that tools have clear algorithms for deciding whether a particular version is affected by a particular CVE.
OSV’s solution
The OSV schema addresses all these ambiguities as follows, which we suggest CVE adopt the basic ideas of. This is not the only possible solution but we believe it is a good one.
The OSV schema supports both an enumeration of specific affected versions and an enumeration of specific affected ranges. The set of affected versions is the OR of the entries in these lists - there is never an AND.
A range specifies a contiguous range of versions according to some defined version ordering. Today, those are “SEMVER” (preferred), “GIT”, and “ECOSYSTEM”. The “GIT” and “ECOSYSTEM” (meaning “packager-defined ordering”) range types are not directly understandable by general-purpose tools; such ranges are extra information understandable only by special-purpose tools. A particular entry is required to ensure that all affected versions are either listed in the explicit enumeration or in a Semver-type range, both of which can be processed by standard, packager-independent algorithms.
Each range is an object with three fields: type (the ordering), introduced, and fixed. The affected versions are those >= introduced and < fixed. If introduced or fixed are omitted, then that end of the range is left open.
For packagers that use Semver ordering, such as Go, NPM, and Rust, it suffices to specify only ranges:
For packagers that use other orderings, a packager-specific range can be listed, but the packager’s own vulnerability database tooling must “compile out” the range into an explicit list as well, for consumption by general-purpose tools, as in this Python example:
(The “GIT” range has an additional field “repo” to specify the URL of the source repository containing the given commits.)
The “versions” list specifies the same versions as in the “ECOSYSTEM” range, just in a more accessible way. General-purpose tooling would ignore the “GIT” and “ECOSYSTEM” ranges, relying instead on the “versions” list in this case.
Potential CVE adaptation
We propose to change the current version schema from:
to:
The only combining operator is OR, making the algorithm for matching much clearer. A particular version would be considered affected if it is matched by any of the entries in the overall “versions” object list. A version is matched by an entry if it appears directly in the “list” or if it is in the “range”. This structure allows non-standard ranges to include their version lists in the same object, which is an improvement over the OSV schema, and it allows a particular range or list to be qualified by a “platform” list as well.
The “unsure” entry allows a range or list to be marked as unsure, equivalent to using the current ?>= etc operators.
The current !>= etc operators are removed: to say that a version is unaffected, leave it unlisted.
The text was updated successfully, but these errors were encountered: