Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML + Dataset BoM spec clarifications and feedback #229

Closed
willarmiros opened this issue May 24, 2023 · 4 comments
Closed

ML + Dataset BoM spec clarifications and feedback #229

willarmiros opened this issue May 24, 2023 · 4 comments
Labels
CDX 1.5 related to release v1.5

Comments

@willarmiros
Copy link

I work at Protect AI. We are building tooling for ML teams to build AI/ML BOM programatically. We are actively looking into offering CycloneDX compliant BOMs for this purpose. We have some questions and feedback on the changes introduced by #209 in the upcoming v1.5 spec.

It is unclear how to associate ComponentData with Dataset files

In the new Data component, there is a ComponentDataContents field that has a single URL. In our experience however, a single logical dataset could be composed of several distinct files, each with their own URL, name, version, hash, etc. One way of representing this would be to create several File subcomponents nested within the Data component since the files are just pieces that make up the Dataset. Another way is to use dependencies to show that a Dataset component depends on 1 or more File components. However both of these approaches would circumvent using the ComponentDataContents field at all, so we are wondering:

  1. Which approach better leverages the CycloneDX spec?
  2. If one of the suggested approaches are preferred, what purpose does ComponentDataContents serve?

Are there concerns with using a purl for model locations?

It is a common practice to serialize a trained model and store it as a file in a model registry or in cloud storage. We plan to use a purl to locate these models, because there is already some support for them in the purl-spec added in package-url/purl-spec#201. For future model registries like KubeFlow, SageMaker (and more) do we anticipate those will need further purl-spec updates? Alternatively, we can come up with custom schema not defined in the purl-spec.

Include hyperparameters in ML Model component

Hyperparameters are the key attributes to reproducing a given ML Model. The component for a Model should capture this data, especially since changing hyperparameters can significantly change the behavior of the model. The SPDX SBOM specification for AI Models includes a hyperparameters entry.

cc @iamfaisalkhan @badarahmed for visibility

@stevespringett
Copy link
Member

@willarmiros. First, thanks for the feedback. Much appreciated. Every URL in CycloneDX can either be a URL or a BOM-Link. So you could have something like:

"datasets": [
  {
    "type": "dataset",
    "name": "Training Data",
    "contents": {
      "url": "urn:cdx:f08a6ccd-4dce-4759-bd84-c626675d60a7/1#MyDataset"
    },
    "classification": "public"
  }
]

If MyDataset was a data component, you could then have multiple file objects within that. This would be the recommended approach if the files all have different versions, provenance, etc. Theoretically, we could change contents to be an array which would simplify things a bit, but you would not be able to have datafiles with different metadata. If you think making contents into an array would help, I can quickly make that change.

Are there concerns with using a purl for model locations?

purl is already supported. So for the model card component, simply add the purl to the purl field. That's why we've designed it the way it is, so that all of the other component fields can be used for ML.

For future model registries like KubeFlow, SageMaker (and more) do we anticipate those will need further purl-spec updates?

If those registries are API-compatible with HuggingFace, then we can continue to use the HuggingFace purl type and siimply supply an alternative repository_url. If they are not API compatible, then they will need their own purl types.

CycloneDX does not have a dedicated field for hyperparameters. We envisioned that those would be captured in CycloneDX Formulation support which documents the precise steps necessary to build or deploy something. In essence, it tries to capture everything necessary for true reproduction. In the case of ML, formulation could be used to describe the precise steps taken during training, evaluation, etc. Its a much more granular approach than what SPDX takes, which seems to be overly simplified.

To use, you would simply add a new externalReference of type formulation to the ML component and supply the bom-ref to the formula that describes how the model was built.

Does that meet your requirements?

@willarmiros
Copy link
Author

If you think making contents into an array would help, I can quickly make that change.

That is good to know that URLs can be BOM-links instead of just an internet URL, so we could link to our file component(s) in the BOM. We were considering suggesting changing the contents field to be an array, but we wanted to first check in about the other approaches such as using nested components or dependencies to represent the dataset -> file(s) relationship. But if the recommended way to model this in CycloneDX is using the contents field (once it's converted to an array), we should be able to go with that approach.

We envisioned that those would be captured in CycloneDX Formulation support

I was not aware of the formulation support feature described in #31. I will take a look at #222 to see if this makes sense for our case.

@jkowalleck
Copy link
Member

With CDX Spec 1.5 the documentation regarding BOM-Link capabilities was made much clearer,
and the needed ML features were added to the document standard.

@willarmiros, how is your status? Any updates or remarks from your side?
Your feedback is appreciated.

@jkowalleck jkowalleck added the CDX 1.5 related to release v1.5 label Jul 12, 2023
@willarmiros
Copy link
Author

Hi @jkowalleck thanks for following up! It seems like 1.5 has addressed our concerns, and #236 helped clarify the use of BOM links which we were initially confused on, so I can go ahead and close this out. Will open another issue if we have further troubles!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDX 1.5 related to release v1.5
Projects
None yet
Development

No branches or pull requests

3 participants