-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML + Dataset BoM spec clarifications and feedback #229
Comments
@willarmiros. First, thanks for the feedback. Much appreciated. Every URL in CycloneDX can either be a URL or a BOM-Link. So you could have something like: "datasets": [
{
"type": "dataset",
"name": "Training Data",
"contents": {
"url": "urn:cdx:f08a6ccd-4dce-4759-bd84-c626675d60a7/1#MyDataset"
},
"classification": "public"
}
] If
purl is already supported. So for the model card component, simply add the purl to the purl field. That's why we've designed it the way it is, so that all of the other component fields can be used for ML.
If those registries are API-compatible with HuggingFace, then we can continue to use the HuggingFace purl type and siimply supply an alternative CycloneDX does not have a dedicated field for hyperparameters. We envisioned that those would be captured in CycloneDX Formulation support which documents the precise steps necessary to build or deploy something. In essence, it tries to capture everything necessary for true reproduction. In the case of ML, formulation could be used to describe the precise steps taken during training, evaluation, etc. Its a much more granular approach than what SPDX takes, which seems to be overly simplified. To use, you would simply add a new externalReference of type Does that meet your requirements? |
That is good to know that URLs can be BOM-links instead of just an internet URL, so we could link to our file component(s) in the BOM. We were considering suggesting changing the contents field to be an array, but we wanted to first check in about the other approaches such as using nested components or dependencies to represent the dataset -> file(s) relationship. But if the recommended way to model this in CycloneDX is using the
I was not aware of the formulation support feature described in #31. I will take a look at #222 to see if this makes sense for our case. |
With CDX Spec 1.5 the documentation regarding BOM-Link capabilities was made much clearer, @willarmiros, how is your status? Any updates or remarks from your side? |
Hi @jkowalleck thanks for following up! It seems like 1.5 has addressed our concerns, and #236 helped clarify the use of BOM links which we were initially confused on, so I can go ahead and close this out. Will open another issue if we have further troubles! |
I work at Protect AI. We are building tooling for ML teams to build AI/ML BOM programatically. We are actively looking into offering CycloneDX compliant BOMs for this purpose. We have some questions and feedback on the changes introduced by #209 in the upcoming v1.5 spec.
It is unclear how to associate ComponentData with Dataset files
In the new Data component, there is a ComponentDataContents field that has a single URL. In our experience however, a single logical dataset could be composed of several distinct files, each with their own URL, name, version, hash, etc. One way of representing this would be to create several File subcomponents nested within the Data component since the files are just pieces that make up the Dataset. Another way is to use dependencies to show that a Dataset component depends on 1 or more File components. However both of these approaches would circumvent using the ComponentDataContents field at all, so we are wondering:
Are there concerns with using a
purl
for model locations?It is a common practice to serialize a trained model and store it as a file in a model registry or in cloud storage. We plan to use a
purl
to locate these models, because there is already some support for them in thepurl-spec
added in package-url/purl-spec#201. For future model registries like KubeFlow, SageMaker (and more) do we anticipate those will need furtherpurl-spec
updates? Alternatively, we can come up with custom schema not defined in thepurl-spec
.Include hyperparameters in ML Model component
Hyperparameters are the key attributes to reproducing a given ML Model. The component for a Model should capture this data, especially since changing hyperparameters can significantly change the behavior of the model. The SPDX SBOM specification for AI Models includes a hyperparameters entry.
cc @iamfaisalkhan @badarahmed for visibility
The text was updated successfully, but these errors were encountered: