Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support component integrity verification #699

Closed
12 tasks
mehab opened this issue Jul 25, 2023 · 7 comments
Closed
12 tasks

Support component integrity verification #699

mehab opened this issue Jul 25, 2023 · 7 comments
Assignees
Labels
component/api-server domain/repo-meta-analysis enhancement New feature or request p2 Non-critical bugs, and features that help organizations to identify and reduce risk size/L High effort

Comments

@mehab
Copy link
Collaborator

mehab commented Jul 25, 2023

This issue expands on issue in upstream dependency track. An initial POC for this has been completed and demoed using hyades-apiserver and hyades
Below features need to be addressed as part of actual implementation:

Tasks

@mehab
Copy link
Collaborator Author

mehab commented Sep 18, 2023

Based on the PR review for #727 we wanted to also store the published date information from the same end point that is used to fetch integrity information for packages.
However, as per current design, integrity check is an optional check that the user can enable/ disable for a repo of choice and then the integrity check would be performed for components considering the configured repo as the source of truth.
This is viable for integrity checks alone.
But published date is a field that we want for all components all the time and is not optional. Also, the repository from which the component is actually fetched should be used to get the published date for the component. Thus we cannot use the integrity check external call as is to support the published date feature.
There are a few options on how we could get publised date too:

  1. Use separate repos to get published date information for packages. For example, we could use maven central to get published date for maven packages. In this case, the user can configure only one source from where the published date would be fetched. If the user does not select any source, we will use maven central. In this case, the integrity check info fetch is separate from the published date config. Can also use artifactory in this sense.
  2. Use deps.dev as the central source of truth for both integrity check and published date. In this case, allow the user to configure integrity check as currently is available. But published date is always fetched for component, if new, from deps.dev. This is not configurable by the user.
    The downside is that the published date feature is an additional call to a similar end point if it is not the same configuration as that used for integrity check.
    This is found to be part of the head request response as was done for integrity check. The header is x-modified.. to be looked for in this call.

@mehab
Copy link
Collaborator Author

mehab commented Sep 18, 2023

Meeting notes from discussion on September 18 2023.

Use Cases

Integrity Verification

  • Detection of "smuggled" packages (Replacing of packages in the package repository)
    • Modified package was resolved from internal repo during build, and is included in SBOM
    • Comparison of package hashes from SBOM with hashes from Maven Central will yield a mismatch
  • Detection of Man In The Middle’d packages
    • Bad actor replaces packages during build when package manager fetches them from repository
  • Nice-to-have: There should be metrics of how many integrity violations occurred per component / project
    • Doesn’t add too much value because metrics in the UI are not used that much
  • It is not necessary to configure repositories specifically for integrity checks; Use the same repository for latest version check, integrity check, published date
  • Integrity check enabled/disabled should be a feature flag at first
    • When disabled, analysis and metrics inclusion would not happen
    • Fetching of the data (hashes, published date) will happen no matter if enabled or not

Multiple Repositories for integrity verification

  • Being able to use Artifactory / other custom repositories
  • Configuration would be priority-based, such that Maven Central can be assigned a higher priority than internal repos
    • Fall-through logic; Iterate through all repos, return on first match, otherwise proceed to next repo
  • Assumption: Only a single repository should match, we only need to store results from one, and not multiple
  • Future: deps.dev could be added as another repository

pkg:maven/com.citi/citi-lib → Artifactory (Internal)
pkg:maven/org.springframework/spring-core → Maven Central

Published Date

  • "Component Age" policies
  • Displaying published date in UI / return in REST API

dtrack.repo-meta-analysis.component

  • Consumer listening for this topic will fetch:
    • Latest Version
    • Published Date
    • Hashes
  • Problems:
    • Latest Version changes over time, published date and hashes (hopefully) do not
      More precise details and flow to be added by @VithikaS

@VithikaS
Copy link
Collaborator

VithikaS commented Sep 19, 2023

`
CREATE TABLE IF NOT EXISTS public."COMPONENT_METADATA"
(
    "ID" bigint NOT NULL,
    "PURL" character varying(1024) NOT NULL,
    "MD5" character varying(1024),
    "SHA1_HASH" character varying(1024),
    "SHA256_HASH" character varying(1024),
    "PUBLISHED_AT" timestamp with time zone,
    "LAST_FETCH" timestamp with time zone,
    "STATUS" character varying(255),
    CONSTRAINT "COMPONENT_METADATA_PK" PRIMARY KEY ("ID")

)`

STATUS : Possible values PROCESSED, TIMED_OUT. There can be additional values

  • Create Index on purl, last_fetch, published_at

  • Data model and queries for new table in api-server

  • 2. @sahibamittal Create Initialiser which will kick off update of new table with hash information and publishedAt date on application startup

Initialiser

  • Select count(ID) from COMPONENT table. If count is 0, exit the Intialiser flow. This will only happen when DT is deployed on fresh db with no data.

  • Select count(ID) from COMPONENT_METADATA table

  • If count == 0

Insert into COMPONENT_METADATA table Select DISTINCT purls, internal from COMPONENT table in one transaction

TODO: check if there is copy command of postgres which can be used or any other better way of doing it.
This is with assumption, copying of purls from COMPONENT table to COMPONENT_METADATA table happens in one transaction so either all required purls are copied or none are. Any count higher than 0 would mean that purls were already copied over from COMPONENT table.

SELECT INTO is much faster than INSERT .. SELECT. Hint could be provided to use table level lock to improve performance of INSERT.. SELECT
INSERT .. SELECT will apply table level lock on table rows are selected from so executing this in parallel to Bom upload can have significant performance impact.

With small load of ~6k distinct purls, INSERT INTO .. SELECT execution took ~ 409ms.

COPY is most optimal for copying bulk data in postgres
As per postgres documentation

Note that loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction.

COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway.

COPY is used to COPY FROM a file or COPY TO a file so may not be ideal choice.

  • Fetch components from table in batches of 5000. We are already fetching pages of components in other places like performing portfolio repo meta analysis so should be similar or same

    "Where" clause to fetch components should check that LAST_FETCH time is either null or an hour before current time, purl hashes, published date should be null as well. Alternatively, this could be checked with STATUS field.
    LAST_FETCH is to prevent same purl is not selected twice in short duration to fetch metadata.
    STATUS field in table will be updated even when we get no data from any of the configured repositories. This is to prevent resending those purls once all configured repositories have been queried to get metadata.

    • batch update the LAST_FETCH to NOW() in database for all selected components
    • use the topic we have for repo meta analysis to dispatch event. We have to send a different command on the topic than what we send for version check on repo meta analysis. Version check is a recurring task because latest version will keep changing. Hashes and PublishedAt is static information which should remain same. We should be able to differentiate if we want to perform
  • 3. @sahibamittal Repo Meta Analyser - read command from dtrack.repo-meta-analysis.component topic to get component metadata. This does not include the latest version

    • Fetch component metadata from the applicable repository with the appropriate fallback in order of priority
    • Result from Repo Meta Analyser should include Component Metadata
  • 4. Api-server - synchronize Repository Meta Component should update COMPONENT_METADATA table with results. If integrity check is globally enabled, it will be evaluated

  • When updating database, batching could be considered to keep it more efficient.

  • We could potentially lose records if we batch and api server restarts before records are committed to database. Those purls will be picked again to fetch data so we would recover eventually. But this would result in more work and should be avoided.

It would be better to use changelog topic and state store so in case of restart, in-memory state store is reconstructed from changelog topic.

  • If we go on to batch records for other database update operations in api-server, there could be a consideration to write our own kafka consumer. It could enable us to commit offset after database changes have been committed. This was discussed to be considered separately.

  • 5. @mehab On Bom Upload, Api Server to send an event on kafka topic only if metadata is not present in database. Otherwise it should send event to just fetch latest version

    • Analysis Result should be update with component metadata.
    • Repo Meta Analyser will receive different commands on topic when it needs to check only version information or only metadata or when it needs to perform both operations
  • Api server to perform integrity check if globally enabled

@sahibamittal
Copy link
Collaborator

sahibamittal commented Sep 27, 2023

Task Highlights

  • Create new table and queries. @sahibamittal
  • Create Initialiser. @sahibamittal
  • Hyades-apiserver changes (bom-upload dispatching event; updating new table after repo-meta analysis results). @mehab
  • Hyades-apiserver changes (perform integrity analysis if enabled and update db). @VithikaS
  • Changes at hyades end (repo-meta analyzer : receiving event, calling repositories, sending data back in repo-meta analysis results). @sahibamittal
  • End-to-end testing locally.
  • New endpoint in apiserver to return integrity analysis data for a component. @mehab
  • Upon deletion of project/component, recursively delete integrity analysis for the component. @VithikaS
  • UI changes -> TBD @sahibamittal

@sahibamittal
Copy link
Collaborator

sahibamittal commented Oct 2, 2023

AnalysisResult Proto should include fields for Component Metadata

message AnalysisResult {
  // The component this result is for.
  Component component = 1;

  // Identifier of the repository where the result was found.
  optional string repository = 2;

  // Latest version of the component.
  optional string latest_version = 3;

  // When the latest version was published.
  optional google.protobuf.Timestamp published = 4;

  // Integrity metadata of the component.
  optional IntegrityMeta integrity_meta = 5;
}

Logic from result at apiserver

integrityMeta not set --> Integrity metadata was not fetched.
integrityMeta set && (it's hashes || date not set) -> Integrity metadata was fetched but not available or faced error.

message IntegrityMeta {
  optional string md5 = 1;
  optional string sha1 = 2;
  optional string sha256 = 3;
  optional string sha512 = 4;
  // When the component current version last modified.
  optional google.protobuf.Timestamp current_version_last_modified = 5;
  // Complete URL to fetch integrity metadata of the component.
  optional string meta_source_url = 6;
}

AnalysisCommand Proto :

`
message AnalysisCommand {
  // The component that shall be analyzed.
  Component component = 1;
  bool fetch_integrity_data = 2;
  bool fetch_latest_version = 3;
}

Analysis Command notes:

fetch_latest_version flag → map latest_version and latest_version_published.
fetch_integrity_data flag → map current_version_published and hashes info.

  1. initializer → [fetch_integrity_data=true, fetch_latest_version=false]
  2. bom-upload → if integrity data doesn't exist then [fetch_integrity_data=true, fetch_latest_version=true]
    else [fetch_integrity_data=false, fetch_latest_version=true]
  3. scheduled anaysis → [fetch_integrity_data=false, fetch_latest_version=true]

@mehab
Copy link
Collaborator Author

mehab commented Oct 3, 2023

A point to note is the change of repository from user when he has already supplied a repository for a package type. If this happens, currently the projects/ components for which the integrity information has already been fetched will not be refreshed with new information. In order to support this feature, keeping in mind that we only support one repository at a time for a given package and not act as mirror for multiple repositories, we could re factor the initializer code to get triggered on whenever the user changes repository url and refresh information for all the existing components.
And then newer projects and components can be filled for information as existing functionality

@mehab
Copy link
Collaborator Author

mehab commented Nov 8, 2023

the changes have now been completed hence closing the issue

@mehab mehab closed this as completed Nov 8, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Hyades Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/api-server domain/repo-meta-analysis enhancement New feature or request p2 Non-critical bugs, and features that help organizations to identify and reduce risk size/L High effort
Projects
Archived in project
Development

No branches or pull requests

4 participants