Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix package repo uniqueness #922

Merged
merged 2 commits into from
Nov 16, 2023
Merged

Conversation

quba42
Copy link
Collaborator

@quba42 quba42 commented Oct 26, 2023

No description provided.

@quba42 quba42 force-pushed the fix_package_repo_uniqueness branch from 20a64c7 to 45098d9 Compare October 30, 2023 15:38
@quba42 quba42 linked an issue Oct 30, 2023 that may be closed by this pull request
@quba42 quba42 force-pushed the fix_package_repo_uniqueness branch from 45098d9 to 44ca24f Compare October 31, 2023 13:39
@quba42
Copy link
Collaborator Author

quba42 commented Oct 31, 2023

I have now tested that this code works for at least some of the cases that prompted this change.
I will now start working on systematic test coverage. In particular we need to cover the following cases:

  • Sync and/or upload of a package set that contains forbidden duplicates => Must raise ValueError
  • Sync and/or upload of a package set that contains allowed duplicates => Must complete and must not discard any duplicates
  • Sync and/or upload of a package that collides with a already present package => previously added colliding package must be removed from the repo version (along with any PRC that reference it).
  • Sync and/or upload of a package that is a allowed duplicate of a already present package => both packages must remain.


for content_dict in batch:
sha256 = content_dict.pop("sha256")
item_query = models.Q(**content_dict) & ~models.Q(sha256=sha256)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely the critical line for performance. This change must not cause a significant performance degradation!

@daviddavis
Copy link
Contributor

daviddavis commented Nov 1, 2023

I'm grateful that you tested out the use case where a package replaces an existing existing package and removes the PackageReleaseComponent. Thank you for that.

It sounds like the issue is that you have two packages with the same NVA but different paths. Is it not possible to solve this problem by adding relative_path to repo_key_fields?

@quba42
Copy link
Collaborator Author

quba42 commented Nov 2, 2023

It sounds like the issue is that you have two packages with the same NVA but different paths. Is it not possible to solve this problem by adding relative_path to repo_key_fields?

The problem is, that a duplicate package (same package, version, architecture fields, but different relative_path) are only allowed, if they have an identical sha256, that is, they really are the same package, stored in two different pool folder locations (for "reasons"). If they have differing sha256, they are not allowed to exist in the same repo twice. The spec is pretty clear on that. So the key difference between what I built here, and the pulpcore's remove_duplicates is the following line:

item_query = models.Q(**content_dict) & ~models.Q(sha256=sha256)  # For pulp_deb vs.:
item_query = Q(**content_dict) # In remove_duplicates from pulpcore

The same issue exists for pulpcore's validate_duplicate_content (used for checking the incoming content for duplicates). Some time ago I dealt with that part of the problem by not using validate_duplicate_content for pulp_deb, because it was preventing synchronization of clearly legal repositories. With hindsight, it would have been better if I had used that occasion to fully understand the problem (rather than trial and error fixing the symptoms). Not using validate_duplicate_content at all solved one problem but caused/masked others. This PR essentially adds a deb specific version of validate_duplicate_content back in. So in that way this PR makes duplicate checking more strict.

@quba42
Copy link
Collaborator Author

quba42 commented Nov 2, 2023

@daviddavis I have been wondering if there is a case to be made, to be stricter for uploads than for sync. For example: It is legal for the pool folder to contain an identical copy of some package in two places, but I would not exactly call it good practice. For the sync we must be able to sync such legal repositories. However, we could choose to protect our upload users from shooting themselves in the foot, by making this forbidden for uploads. Of course, we also support mixing uploaded and synced content in a single Pulp repository so such discrimination could quickly get complicated. For now I am leaning towards not doing anything like this, but I would be interested to hear your thoughts.

@daviddavis
Copy link
Contributor

@quba42 in terms of our use case, we don't allow users to set the relative_path when they upload. So it's not a problem for us I don't think. But I think being stricter for uploads makes sense. And I think postponing that work until it becomes an issue also makes sense.

@quba42 quba42 added .bugfix CHANGES/<issue_number>.bugfix Tests labels Nov 13, 2023
closes pulp#921

As a result of the behaviour fixes, we can also drop the now superfluous
duplicate distribution checking. We are using validate_duplicate_content
to catch incoming duplicates, and we are removing old duplicates, so it
is not possible to run into this error. As a result it is not worth the
performance cost to check for it.
@quba42 quba42 marked this pull request as ready for review November 15, 2023 14:03
@quba42 quba42 merged commit 7ede5be into pulp:main Nov 16, 2023
8 checks passed
Copy link

patchback bot commented Nov 16, 2023

Backport to 3.0: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.0/7ede5bed2dfd9caa5c3722ed837821eda5052106/pr-922

Backported as #947

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Copy link

patchback bot commented Nov 16, 2023

Backport to 3.1: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.1/7ede5bed2dfd9caa5c3722ed837821eda5052106/pr-922

Backported as #948

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback bot pushed a commit that referenced this pull request Nov 16, 2023
Fix package repo uniqueness

(cherry picked from commit 7ede5be)
patchback bot pushed a commit that referenced this pull request Nov 16, 2023
Fix package repo uniqueness

(cherry picked from commit 7ede5be)
@quba42 quba42 deleted the fix_package_repo_uniqueness branch November 16, 2023 20:16
hstct added a commit that referenced this pull request Nov 17, 2023
…d9caa5c3722ed837821eda5052106/pr-922

[PR #922/7ede5bed backport][3.0] Fix package repo uniqueness
hstct added a commit that referenced this pull request Nov 17, 2023
…d9caa5c3722ed837821eda5052106/pr-922

[PR #922/7ede5bed backport][3.1] Fix package repo uniqueness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.bugfix CHANGES/<issue_number>.bugfix Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pulp_deb duplicate package handling is erroneous
3 participants