Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull-mirror sync seems unnecessarily slow #18352

Closed
petergardfjall opened this issue Jan 21, 2022 · 2 comments
Closed

pull-mirror sync seems unnecessarily slow #18352

petergardfjall opened this issue Jan 21, 2022 · 2 comments
Labels
performance/speed performance issues with slow downs type/bug
Milestone

Comments

@petergardfjall
Copy link
Contributor

petergardfjall commented Jan 21, 2022

Gitea Version

6c7084c

Git Version

2.25.1

Operating System

Linux (Ubuntu 20.04)

How are you running Gitea?

Built from source, run locally on the host.

Database

PostgreSQL

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Description

A mirror-sync operation not only runs git remote update, but also tries to sync any repo releases with the available repo tags. For large repositories with many tags, this can be a costly operation, both in time and computational resources.

I noticed that when doing a "plain git mirror" (which doesn't include any releases) of a big repo (such as Kubernetes with about 900 tags), the synchronize operation (SyncReleasesWithTags) spent a lot of time (about six minutes) listing/syncing tags with releases.

This appears to be caused by repetitive calls like (one for each repo tag):

git show-ref --tags -- v8.2.4477
git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f

In particular git rev-list --count' can be heavy for large repos with many commits.

It seems like there is an opportunity to improve performance and reduce resource use by making this procedure more efficient for pull-mirrors.

@petergardfjall

This comment was marked as outdated.

@zeripath zeripath added type/bug performance/speed performance issues with slow downs labels Jan 30, 2022
@petergardfjall petergardfjall changed the title A mirror-sync on a mirror without releases is unnecessarily slow pull-mirror sync seems unnecessarily slow Mar 18, 2022
6543 pushed a commit that referenced this issue Mar 31, 2022
This addresses #18352

It aims to improve performance (and resource use) of the `SyncReleasesWithTags` operation for pull-mirrors.

For large repositories with many tags, `SyncReleasesWithTags` can be a costly operation (taking several minutes to complete). The reason is two-fold:
    
1. on sync, every upstream repo tag is compared (for changes) against existing local entries in the release table to ensure that they are up-to-date.
    
2. the procedure for getting _each tag_ involves a series of git operations    
    ```bash
     git show-ref --tags -- v8.2.4477
     git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     ```    

     of which the `git rev-list --count` can be particularly heavy.
    
This PR optimizes performance for pull-mirrors. We utilize the fact that a pull-mirror is always identical to its upstream and rebuild the entire release table on every sync and use a batch `git for-each-ref .. refs/tags` call to retrieve all tags in one go.
    
For large mirror repos, with hundreds of annotated tags, this brings down the duration of the sync operation from several minutes to a few seconds. A few unscientific examples run on my local machine:

- https://github.com/spring-projects/spring-boot (223 tags)
  - before: `0m28,673s`
  - after: `0m2,244s`
- https://github.com/kubernetes/kubernetes (890 tags)
  - before: `8m00s`
  - after: `0m8,520s`
- https://github.com/vim/vim (13954 tags)
  - before: `14m20,383s`
  - after: `0m35,467s`

 

I added a `foreachref` package which contains a flexible way of specifying which reference fields are of interest (`git-for-each-ref(1)`) and to produce a parser for the expected output. These could be reused in other places where `for-each-ref` is used.  I'll add unit tests for those if the overall PR looks promising.
@petergardfjall
Copy link
Contributor Author

Closing this issue since #19125 is now merged.

@lunny lunny added this to the 1.17.0 milestone Mar 31, 2022
@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance/speed performance issues with slow downs type/bug
Projects
None yet
Development

No branches or pull requests

3 participants