Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuously watch a PURL of interest, aka. purlwatch #244

Closed
7 tasks done
pombredanne opened this issue Dec 13, 2023 · 5 comments
Closed
7 tasks done

Continuously watch a PURL of interest, aka. purlwatch #244

pombredanne opened this issue Dec 13, 2023 · 5 comments
Assignees

Comments

@pombredanne
Copy link
Member

pombredanne commented Dec 13, 2023

Given a PURL, I would like to have a new API endpoint to register "interest" in this PURL. Once this is "registered", this PURL should continuously be "watched" for updates to this Package URL (all versions), polling for new versions on schedule, like on daily, or weekly basis.
When a new version is discovered, we should run the steps to collect metadata and trigger new scancode pipeline runs on each update.
Optionally we could also work from a PURL and have a flag to expand to either the previous and next versions, or to all versions of a PURL, or all the future versions.

The solution could include these elements:

@pombredanne
Copy link
Member Author

Carried over from #88 closed in favor of this one:

I would like to ensure that PurlDB has up-to-date packages for my projects.
For this I would like to somehow register interest for some package version, and purlDB would go fetch newer version of these packages on a regular basis and also would fetch the newer versions of the whole dependencies set of these packages all the way down. 🐢🐢🐢🐢🐢🐢🐢

See also #87

Other things to consider carried over from #64 :

  • Trigger mining, matching, scanning of the dependency trees, possibly with priority for runtime dependencies over others
  • Trigger mining, matching, scanning for newer versions of packages that were matched
  • Trigger mining, matching, scanning for older versions of packages that were matched
  • Create a watcher that looks periodically for new versions of a package that was requested, may be by registering interest in this package

We have ways to trigger the indexing of a package version range now, but nothing specifically to do anything periodically.
Some inspiration:

  • Debian watch files
  • Fedora Bohdi

@pombredanne
Copy link
Member Author

pombredanne commented Jan 10, 2024

In a the future, we could refine the watch to only look for versions after a version or after a release date to avoid collecting a badzillion of old, historical, unused versions.

This could take the form of:

  • a date field that would would be the earliest_release_date to filter out older versions
  • a minimum version field that would would be the minimum_version to filter out older versions, with the caveat that popular can have v3.1 released after v4.0, e.g., multiple release "lines" active at a time.
  • only consider versions created since last watch and always set the last watch to a default value that could today or in a recent past.

@pombredanne
Copy link
Member Author

There is a design that needs some thinking: How do we select which PackageWatch to process?

  1. We could store the next watch date. This is would need to be done in save() and is problematic if there is any exception. But this is simple afterwards to filter only the records that have next watch date less or equal to today

  2. We could use a DatetimeField and DurationField and do a date computation in the queryset filter as an expression but this is not portable beyond PostgreSQL per https://docs.djangoproject.com/en/5.0/ref/models/fields/#durationfield

  3. We could not filter on dates but instead compute if a watch is eligible in a run in a loop. The watch interval would stored as plain integer in days. We would loop this way:

  • we do a simple filter on active watches.
  • then we iterate and for each we compute the next watch date as: last watch date + watch interval and skip a record if this is not less or equal to now.

This would be portable beyond PostgreSQL but requires more processing, as most watches would need to be look at once on each run and would not use a queryset expression.

Here the last watch date is a DatetimeField and the watch_interval would be a number of days between watches.

  1. we could store instead a POSIX timestamp as an integer for the last watch date and store a number of seconds in an integer as the watch interval (but no less than one hours or 3600 seconds) and use a an arithmetic expression in the queryset. This would be like 2. but portable beyond PostgreSQL. But display would need to convert the timestamp valud back to a datetime.

Let's go with 3. for start as we can easily migrate to other options afterwards: this is the most expressive and easiest to test:

@pombredanne
Copy link
Member Author

Another design point is the processing:

The thing could start with either a cron-like task in RQ or a command line management command.
This could be a function that can be re-used in both. A command line would be helpful for testing, and a task would be fine for production. This could run daily for a start (but it could run more than daily too).

In all cases, this would select packages eligible for watch per #244 (comment) and then could either:

  1. create one RQ task for each package to watch
  2. OR, create one PriorityQueue entry for each package to watch
  3. OR, in the future we could decide to batch check for the new versions of a whole ecosystem at once, possibly excluding the things not watched.

Using 2. is probably best for now to avoid duplicated entries.
We should also ensure that there is no pre-existing queue entry for this watch. This could be done either by querying the queue first OR/AND during the watch run checking that we did not watched already in the watch interval.

JonoYang added a commit that referenced this issue Jan 30, 2024
Watch for packages (model and implementation) #244
@keshav-space keshav-space moved this from In review to Done in 05-purl2all - PURL services Feb 9, 2024
@pombredanne
Copy link
Member Author

pombredanne commented Apr 3, 2024

This is completed now.

We did extend PurlDB with a new “watch” API endpoint. Given a PURL, you can register "interest" in this PURL and continuously "watch" for updates to this Package URL (ignoring versions), polling for new versions on scheduled based on a per-watched defined interval.

To back this, we integrate a queue and scheduler (using RQ). When the watch runs on schedule and a new package version is available, we further collect metadata and trigger new indexing scans in the background.

To test this feature:

  • Locally:
  1. Install PurlDB
    or request access to a private demo instance at https://private.purldb.io/api

  2. CLI-based watch: From the command line, in the activated virtualenv, run the command to create a new watch for a PURL such as
    ./manage_manage_purldb.py watch_packages --purl pkg:maven/org.apache.logging.log4j/log4j

  3. Go to https:///api/watch/ to view existing watched packages and also see the results of indexing this

  • Remotely:
    Alternatively, since we also have an endpoint for watch, you can simply add the new PURL here https://private.purldb.io/api/watch. (Access on request).
    Once a new PURL is added, a watch is scheduled immediately, and thereafter, it will be watched periodically as per the specified watch interval or every 7 days by default. You can view the details of a watch by going to https://private.purldb.io/api/watch/{PURL} there we can find the last watch date, errors and so on.
    Note that for every new watch, an immediate watch is automatically triggered, for already existing watch we can only change the watch interval but it can not be less than 1 day. And If a PURL doesn't already exists in PurlDB then it will index all the version of that PURL as part of the watch feature.
    Specifically the API behaves this way:
{
    "package_url": "pkg:pypi/fetchcode",
    "depth": 3,
    "watch_interval": 7,
    "is_active": true
}

You may want to use another PURL as this may already been watched

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants