-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rubygems): Ensure consistency between versions and metadata #25127
fix(rubygems): Ensure consistency between versions and metadata #25127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a readme.md to this datasource folder describing how the caching works, particularly for rubygems.org. Either that or much more commented code inline
Here is the state diagram I'm about to place in the doc. stateDiagram-v2
[*] --> Empty
state "Empty" as Empty
Empty --> FullSync: getPkgReleases()
state "Synced" as Synced
Synced --> DeltaSync
state "Unsupported" as Unsupported
Unsupported --> [*]
state "Full sync" as FullSync : GET /versions (~20Mb)
state full_sync_result <<choice>>
FullSync --> full_sync_result: Response
full_sync_result --> Synced: (1) Status 200
full_sync_result --> Unsupported: (2) Status 404
full_sync_result --> Empty: (3) Status other than 200 or 404\n Clear cache and throw ExternalHostError
state "Delta sync" as DeltaSync: GET /versions with "Range" header
state delta_sync_result <<choice>>
DeltaSync --> delta_sync_result: Successful response
delta_sync_result --> Synced: (1) Status other than 206\nFull data is received, extract and replace old cache\n (as if it is the full sync)
delta_sync_result --> FullSync: (2) The head of response doesn't match\n the tail of the previously fetched data
delta_sync_result --> Synced: (3) The head of response matches\n the tail of the previously fetched data
state delta_sync_error <<choice>>
DeltaSync --> delta_sync_error: Error response
delta_sync_error --> FullSync: (1) Status 416 should not happen\nbut moves to full sync
delta_sync_error --> Unsupported: (2) Status 404
delta_sync_error --> Empty: (3) Status other than 404 or 416
|
Co-authored-by: Michael Kriese <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still an open question
@zharinov so how many cache layers do we have for rubygems.org? Eg how is the /versions cached? And do we cache per-version too? How is the data in /versions joined with the API data? |
@rarkins Answered you in the readme |
When I run this locally, it appears to do a full |
Data from |
Is there any "hot" cache per package? e.g. if I have a repo with one ruby dependency, and I run on that repo two times in a row, I wouldn't expect to see any datasource lookups at all - just the package cache. |
It should query |
But why do we query |
Remember that the hosted app does one repo per run so will never have |
Just to clarify, Rubygems in-memory data is stored at the module level, so it isn't being reset between repo runs, like memory cache from |
I understand, but in the hosted app we have one run per job, so that doesn't help |
Okay, in this case yes. I didn't like current decision anyway, but it was what we did historically up to this moment. |
I'll create a new issue for this part then |
🎉 This PR is included in version 37.36.1 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
…vatebot#25127) Co-authored-by: Michael Kriese <[email protected]>
Changes
Ensure
/versions
endpoint data is always consistent with/api/v1/gems
responses, otherwise fallback to results containing onlyversion
field.This is important, because
/versions
could contain fresh data, while/api/v1/gems
endpoint still may return older data for some short amount of time. If we cache the data at this moment, we're are risking to store inconsistent data for very long period of time.To solve this, we hash list of versions from the
/versions
endpoint, and if it has changed, we invalidate the cache.The key point of this PR: when persisting the cache for long term, we don't use previously calculated hash from the
/versions
endpoint, we calculate it based on the/api/v1/versions
response (which we are about to cache). This should make both cache layers consistent.Context
Documentation (please check one with an [x])
How I've tested my work (please select one)
I have verified these changes via: