Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a focused visitor for maven #179

Open
JonoYang opened this issue Aug 30, 2023 · 2 comments
Open

Create a focused visitor for maven #179

JonoYang opened this issue Aug 30, 2023 · 2 comments

Comments

@JonoYang
Copy link
Member

In the context of aboutcode-org/scancode.io#900, it would be useful to have a focused visitor for maven that can visit and index the entire maven index (and other repos other than Maven Central). This visitor would be able to visit and index subsets of the repo, instead of collecting the entirety of the index from an arbitrary point.

As we visit, we can create packages with minimal package information (purl+sha1). This would help us identify maven packages by sha1, which we then can go scan and index fingerprints of the package.

@JonoYang
Copy link
Member Author

JonoYang commented Sep 5, 2023

After talking with @pombredanne, this is how it would work:

There would be a function that receives a url and we check to see if its one of these types of maven repo urls:

  • If the url is to the root of a maven repo, like https://repo1.maven.org/maven2/, then we would index the entire repo
  • If the url is to a specific namespace, like https://repo1.maven.org/maven2/org/, or https://repo1.maven.org/maven2/date/iterator/automaton/, then we index all packages under these namespaces
  • If the url is to a specific package, like https://repo1.maven.org/maven2/activeio/activeio/2.1/, or any artifact of it, like https://repo1.maven.org/maven2/activeio/activeio/2.1/activeio-2.1.jar, then we index all artifacts of this version (sources, tests, etc.).

We would put these jobs on a queue, like the visitor/mapper or on-demand package queue, where each entry on the queue would point to a particular package version (e.g. https://repo1.maven.org/maven2/activeio/activeio/2.1/), and it would create packages for all of the artifacts.

@pombredanne
Copy link
Member

Some extra comments:

  • We would need a way to determine the root URL of a repo. The root is defined by the presence of archetype-catalog.xml as in https://repo.maven.apache.org/maven2/archetype-catalog.xml ... this could be searched given a URL but looking upwards until we find the archetype-catalog.xml and cached afterwards to avoid making many network calls. The root URL is needed to determine where the namespace/groupid starts in the URL. See also https://maven.apache.org/repository/layout.html

  • the presence of maven-metadata.xml is a good clue that we are at a page listing versions as in https://repo.maven.apache.org/maven2/adarwin/adarwin/maven-metadata.xml (unless there is no directory in this page or unless there is a .pom in this page in which case the maven-metadata.xml can be ignored.

  • the presence of a .pom file means we have reached a terminal directory and crawling further down is not needed.

JonoYang added a commit that referenced this issue Sep 6, 2023
JonoYang added a commit that referenced this issue Sep 12, 2023
JonoYang added a commit that referenced this issue Sep 12, 2023
JonoYang added a commit that referenced this issue Sep 12, 2023
JonoYang added a commit that referenced this issue Sep 12, 2023
JonoYang added a commit that referenced this issue Sep 19, 2023
    * This is to avoid using get_maven_root repeatedly
    * Save versionless purl to importable_uris

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Sep 20, 2023
JonoYang added a commit that referenced this issue Sep 22, 2023
JonoYang added a commit that referenced this issue Sep 27, 2023
JonoYang added a commit that referenced this issue Sep 27, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Sep 28, 2023
    * Get links and timestamps at the same time
    * Create command that gets release_date for maven packages

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Sep 29, 2023
JonoYang added a commit that referenced this issue Sep 29, 2023
    * Add logging messages

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Sep 29, 2023
    * Only update release_date for packages from maven.org

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Sep 29, 2023
    * Only update release_date for packages from maven.org

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Sep 29, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
    * This is to avoid using get_maven_root repeatedly
    * Save versionless purl to importable_uris

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Oct 11, 2023
    * Get links and timestamps at the same time
    * Create command that gets release_date for maven packages

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Oct 11, 2023
JonoYang added a commit that referenced this issue Oct 11, 2023
    * Add logging messages

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Oct 11, 2023
    * Only update release_date for packages from maven.org

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants