Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create indexing queue to fetch, scan and index existing purldb packages #14

Closed
pombredanne opened this issue Dec 8, 2022 · 4 comments
Closed

Comments

@pombredanne
Copy link
Member

No description provided.

@rakeshhotker
Copy link

could you describe more about the issue. My understanding is that you want to create something similar to indexed Priority queue data structure for purldb packages.

@Chaitanya674
Copy link

@pombredanne can you please provide me more detail , I have setup the env on the local system but I need more information to understand this . I want to contribute in this project

@JonoYang
Copy link
Member

The current scan queue works as follows:

  • A ScannableURI is created for every Package that is mapped by run_map
  • The management command request_scans sends a scan request to a scancode.io instance for every ScannableURI
    • Only three requests are sent at once to scancode.io
    • Scancode.io can only scan one package at a time
  • The management command process_scans checks on the scan requests to see if they have finished
  • If they are, the results of the scan are retrieved and the Resource data and directory fingerprints are indexed

The issue with this setup is that we can only use a single scancode.io instance, which limits us to scanning and indexing packages one at a time. The queue needs to change such that scancode.io is the one polling the scan queue for packages that need to be scanned, rather than the scan queue requesting packages to be scanned.

Some initial features:

  • Expose ScannableURIs as an API endpoint
  • Update API to allow POST requests to be made for particular ScannableURIs, such that when a scancode.io instance picks a ScannableURI to process, it can send a POST to update the status of that ScannableURI, so no other scancode.io instance chooses it for processing
  • When a scancode.io instance is done with a scan, allow it to send the results back through a POST request

@JonoYang
Copy link
Member

We have something like this, now that we have the collect/index_packages/ endpoint (https://github.com/nexB/purldb/blob/main/packagedb/api.py#L568)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants