Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rubygems): Ensure consistency between versions and metadata #25127

Merged
merged 24 commits into from
Oct 30, 2023

Conversation

zharinov
Copy link
Collaborator

@zharinov zharinov commented Oct 10, 2023

Changes

Ensure /versions endpoint data is always consistent with /api/v1/gems responses, otherwise fallback to results containing only version field.

This is important, because /versions could contain fresh data, while /api/v1/gems endpoint still may return older data for some short amount of time. If we cache the data at this moment, we're are risking to store inconsistent data for very long period of time.

To solve this, we hash list of versions from the /versions endpoint, and if it has changed, we invalidate the cache.

The key point of this PR: when persisting the cache for long term, we don't use previously calculated hash from the /versions endpoint, we calculate it based on the /api/v1/versions response (which we are about to cache). This should make both cache layers consistent.

Context

Documentation (please check one with an [x])

  • I have updated the documentation, or
  • No documentation update is required

How I've tested my work (please select one)

I have verified these changes via:

  • Code inspection only, or
  • Newly added/modified unit tests, or
  • No unit tests but ran on a real repository, or
  • Both unit tests + ran on a real repository

@zharinov zharinov requested review from viceice and rarkins October 10, 2023 20:52
@zharinov zharinov changed the title fix(rubygems): Handle hash mismatch for cached results fix(rubygems): Ensure consistency between versions and timetsamp metadata Oct 11, 2023
@zharinov zharinov changed the title fix(rubygems): Ensure consistency between versions and timetsamp metadata fix(rubygems): Ensure consistency between versions and metadata Oct 11, 2023
Copy link
Collaborator

@rarkins rarkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a readme.md to this datasource folder describing how the caching works, particularly for rubygems.org. Either that or much more commented code inline

@zharinov
Copy link
Collaborator Author

Here is the state diagram I'm about to place in the doc.
It isn't as visually appealing as I want it to be, but at least we have diagram form for this complex logic, instead of just text description.

stateDiagram-v2
  [*] --> Empty

  state "Empty" as Empty
  Empty --> FullSync: getPkgReleases()

  state "Synced" as Synced
  Synced --> DeltaSync

  state "Unsupported" as Unsupported
  Unsupported --> [*]

  state "Full sync" as FullSync : GET /versions (~20Mb)
  state full_sync_result <<choice>>
  FullSync --> full_sync_result: Response
  full_sync_result --> Synced: (1) Status 200
  full_sync_result --> Unsupported: (2) Status 404
  full_sync_result --> Empty: (3) Status other than 200 or 404\n Clear cache and throw ExternalHostError

  state "Delta sync" as DeltaSync: GET /versions with "Range" header
  state delta_sync_result <<choice>>
  DeltaSync --> delta_sync_result: Successful response
  delta_sync_result --> Synced: (1) Status other than 206\nFull data is received, extract and replace old cache\n (as if it is the full sync)
  delta_sync_result --> FullSync: (2) The head of response doesn't match\n the tail of the previously fetched data
  delta_sync_result --> Synced: (3) The head of response matches\n the tail of the previously fetched data

  state delta_sync_error <<choice>>
  DeltaSync --> delta_sync_error: Error response
  delta_sync_error --> FullSync: (1) Status 416 should not happen\nbut moves to full sync
  delta_sync_error --> Unsupported: (2) Status 404
  delta_sync_error --> Empty: (3) Status other than 404 or 416
Loading

@zharinov zharinov requested a review from viceice October 16, 2023 23:17
Copy link
Member

@viceice viceice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still an open question

@rarkins rarkins requested review from viceice and rarkins October 20, 2023 12:45
@zharinov zharinov requested review from viceice and removed request for viceice October 20, 2023 18:06
@rarkins
Copy link
Collaborator

rarkins commented Oct 21, 2023

@zharinov so how many cache layers do we have for rubygems.org? Eg how is the /versions cached? And do we cache per-version too? How is the data in /versions joined with the API data?

@zharinov
Copy link
Collaborator Author

@rarkins Answered you in the readme

@zharinov zharinov requested review from rarkins and viceice October 25, 2023 17:10
@rarkins
Copy link
Collaborator

rarkins commented Oct 26, 2023

When I run this locally, it appears to do a full /versions sync/download every run. i.e. if I run it against one repo over and over, it's downloading the full /versions every time without any caching. Is that both the current as well as new behavior, or did it change?

@zharinov
Copy link
Collaborator Author

Data from /versions always has been stored in-memory

@rarkins
Copy link
Collaborator

rarkins commented Oct 26, 2023

Is there any "hot" cache per package? e.g. if I have a repo with one ruby dependency, and I run on that repo two times in a row, I wouldn't expect to see any datasource lookups at all - just the package cache.

@zharinov
Copy link
Collaborator Author

It should query /versions endpoint 1 time and package-related endpoint 1 time, if two runs are in the same node process

@rarkins
Copy link
Collaborator

rarkins commented Oct 26, 2023

But why do we query /versions at all if we have a <1 minute "hot" cache of the exact dependency we are looking up?

@rarkins
Copy link
Collaborator

rarkins commented Oct 26, 2023

Remember that the hosted app does one repo per run so will never have /versions cached between jobs, meaning it downloads the entire /versions unnecessarily many times. It would be better if we consult the package cache for each package we look up and return that result without downloading or syncing /versions. But should it be a separate issue/PR?

@zharinov
Copy link
Collaborator Author

Just to clarify, Rubygems in-memory data is stored at the module level, so it isn't being reset between repo runs, like memory cache from util folder does. I.e. it exists during Node process run.

@rarkins
Copy link
Collaborator

rarkins commented Oct 26, 2023

I understand, but in the hosted app we have one run per job, so that doesn't help

@zharinov
Copy link
Collaborator Author

Okay, in this case yes. I didn't like current decision anyway, but it was what we did historically up to this moment.

@rarkins
Copy link
Collaborator

rarkins commented Oct 26, 2023

I'll create a new issue for this part then

@rarkins rarkins added this pull request to the merge queue Oct 30, 2023
Merged via the queue into renovatebot:main with commit bb0a2d3 Oct 30, 2023
34 checks passed
@rarkins rarkins deleted the fix/rubygems-inconsistent-cache branch October 30, 2023 13:45
@renovate-release
Copy link
Collaborator

🎉 This PR is included in version 37.36.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

jon4hz pushed a commit to jon4hz/renovate that referenced this pull request Nov 9, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rubygems cache: check package cache first
4 participants